Aggregating JSON arrays and calculating set union size in MySQL - mysql

I have a use case where I need to calculate set overlaps over arbitrary time periods.
My data looks like this, when loaded into pandas. In MySQL the user_ids is stored with the data type JSON.
I need to calculate the size of the union set when grouping by the date column. E.g, in the example below, if 2021-01-31 is grouped with 2021-02-28, then the result should be
In [1]: len(set([46, 44, 14] + [44, 7, 36]))
Out[1]: 5
Doing this in Python is trivial, but I'm struggling with how to do this in MySQL.
Aggregating the arrays into an array of arrays is easy:
SELECT
date,
JSON_ARRAYAGG(user_ids) as uids
FROM mytable
GROUP BY date
but after that I face two problems:
How to flatten the array of arrays into a single array
How to extract distinct values (e.g. convert the array into a set)
Any suggestions? Thank you!
PS. In my case I can probably get by with doing the flattening and set conversion on the client side, but I was pretty surprised at how difficult something simple like this turned out to be... :/

As mentioned in other comments, storing JSON arrays in your database really is sub-optimal and should really be avoided. Aside from that, it actually is easier to first extract the JSON array (and get the result you wanted from your second point):
SELECT mytable.date, jtable.VAL as user_id
FROM mytable, JSON_TABLE(user_ids, '$[*]' COLUMNS(VAL INT PATH '$')) jtable;
From here on out, we can group the dates again and recombine the user_ids into a JSON array with the JSON_ARRAYAGG function you already found:
SELECT mytable.date, JSON_ARRAYAGG(jtable.VAL) as user_ids
FROM mytable, JSON_TABLE(user_ids, '$[*]' COLUMNS(VAL INT PATH '$')) jtable
GROUP BY mytable.date;
You can try this out in this DB fiddle.
NOTE: this does require mysql 8+/mariaDB 10.6+.

Thank you for the answers.
For anybody who's interested, the solution that I ended up with in the end was to store the data like this:
And then do the set calculations in pandas.
(
df.groupby(pd.Grouper(key="date", freq="QS"),).aggregate(
num_unique_users=(
"user_ids",
lambda uids: len(set([_ for ul in uids for _ in ul])),
),
)
)
I was able to reduce a 20GiB table to around 300MiB, which is fast enough to query and retrieve data from.

Related

Snowflake interpreting timestamp wrong?

I'm loading a bunch of semi-structured data (JSON) into my database through Snowflake. The timestamp values in the entries are javascript timestamps that look like this:
"time": 1621447619899
Snowflake automatically converts this into a timestamp variable that looks like this:
53351-08-15 22:04:10.000.
All good so far. However, I think that the new timestamp is wrong. The actual datetime should by May 19, 2021 around 12pm MDT. Am I reading it wrong? Is it dependent on the timezone that my Snowflake instance is in?
When comparing the following options manually in SQL:
with x as (
SELECT parse_json('{time:1621447619899}') as var
)
SELECT var:time,
var:time::number,
var:time::varchar::timestamp,
1621447619899::timestamp,
'1621447619899'::timestamp,
var:time::timestamp
FROM x;
It appears that what you want to do is execute the following:
var:time::varchar::timestamp
Reviewing the documentation it does look like the to_timestamp is looking for the number as a string, so you need to cast to varchar first, and then cast to timestamp, otherwise you get what you are getting.
The question says that Snowflake transforms it to "53351-08-15 22:04:10.000" looks right, but it doesn't look right to me.
When I try the input number in Snowflake I get this:
select '1621447619899'::timestamp;
-- 2021-05-19T18:06:59.899Z
That makes a lot more sense.
You'll need to provide more code or context for further debugging - but if you tell Snowflake to transform that number to a timestamp, you'll get the correct timestamp out.
See the rules that Snowflake uses here:
https://docs.snowflake.com/en/sql-reference/functions/to_timestamp.html#usage-notes
The ::timestamp handles strings and numeric inputs differently. I.e. a string is added to 1970-01-01 as milliseconds (correct) whereas the numeric value is added in seconds which returns a date way in the future "53351-08-18 20:38:19.000".
SELECT TO_VARCHAR(1621447619899::timestamp) AS numeric_input
,'1621447619899'::timestamp AS string_input
numeric_input = 53351-08-18 20:38:19.000
string_input = 2021-05-19 18:06:59.899
Solutions are to convert to a string or divide by 1000:
SELECT TO_TIMESTAMP(time::string)
SELECT TO_TIMESTAMP(time/1000)

How to extract a time stamp from a string variable in mysql

I have the following data in a column b which is part of table x.
Table x Column b
{"op":"&","c":[{"type":"date","d":">=","t":1459756800}],"showc":[true]}
{"op":"&","showc":[true],"c":[{"type":"date","d":">=","t":1460534400}]}
I tried to use the query table below to extract my data but does not work, as the timestamps are in different positions.
SELECT substring(Column b, 44 , 10)
FROM Table x
How would I go about extracting just the timestamp.
Much Appreciated
This answer is for MySQL < 5.7, seems 5.7 added json support
Native querying does not support JSON parsing which would lead to all kinds of trouble if you tried to parse this column as a string. Example of issue would a difference in timestamp's value placement due to different proprieties, string length, etc.
You need to add json parsing support either through script (php, ...) or augment mysql functionality
I never got around to use it but common-schema could help you out. I am sure there are other ways
https://code.google.com/archive/p/common-schema/
Usage example from http://mechanics.flite.com/blog/2013/04/08/json-parsing-in-mysql-using-common-schema/:
mysql> select common_schema.extract_json_value(f.event_data,'/age') as age,
-> common_schema.extract_json_value(f.event_data,'/gender') as gender,
-> sum(f.event_count) as event_count
-> from json_event_fact f
-> group by age, gender;

Postgresql - Counting Elements of a 2D JSON array

I have a Postgresql JSON 2D array column containing string terms, e.g.:
Input
[["edwards", "block", "row"], ["edwards"], ["block"]]
Is it possible to compute the occurrence of each term purely in Postgresql? e.g.:
Output
Terms, Occur
["edwards", "block", "row"] [2,2,1]
(Or in some similar format). Or would I have to compute the occurrences using a programming language?
Here's one solution:
select array_to_json(array(
select json_array_length(a.value)
from json_array_elements('[["edwards", "block", "row"], ["edwards"], ["block"]]'::json) a
));
Figured this one out eventually. The query is as follows:
SELECT COUNT(t), t
FROM (SELECT jsonb_array_elements_text(json->'named_entities') AS t
FROM tweets WHERE event_id = XX) AS t1
GROUP BY t

PostgreSQL return result set as JSON array?

I would like to have PostgreSQL return the result of a query as one JSON array. Given
create table t (a int primary key, b text);
insert into t values (1, 'value1');
insert into t values (2, 'value2');
insert into t values (3, 'value3');
I would like something similar to
[{"a":1,"b":"value1"},{"a":2,"b":"value2"},{"a":3,"b":"value3"}]
or
{"a":[1,2,3], "b":["value1","value2","value3"]}
(actually it would be more useful to know both). I have tried some things like
select row_to_json(row) from (select * from t) row;
select array_agg(row) from (select * from t) row;
select array_to_string(array_agg(row), '') from (select * from t) row;
And I feel I am close, but not there really. Should I be looking at other documentation except for 9.15. JSON Functions and Operators?
By the way, I am not sure about my idea. Is this a usual design decision? My thinking is that I could, of course, take the result (for example) of the first of the above 3 queries and manipulate it slightly in the application before serving it to the client, but if PostgreSQL can create the final JSON object directly, it would be simpler, because I still have not included any dependency on any JSON library in my application.
TL;DR
SELECT json_agg(t) FROM t
for a JSON array of objects, and
SELECT
json_build_object(
'a', json_agg(t.a),
'b', json_agg(t.b)
)
FROM t
for a JSON object of arrays.
List of objects
This section describes how to generate a JSON array of objects, with each row being converted to a single object. The result looks like this:
[{"a":1,"b":"value1"},{"a":2,"b":"value2"},{"a":3,"b":"value3"}]
9.3 and up
The json_agg function produces this result out of the box. It automatically figures out how to convert its input into JSON and aggregates it into an array.
SELECT json_agg(t) FROM t
There is no jsonb (introduced in 9.4) version of json_agg. You can either aggregate the rows into an array and then convert them:
SELECT to_jsonb(array_agg(t)) FROM t
or combine json_agg with a cast:
SELECT json_agg(t)::jsonb FROM t
My testing suggests that aggregating them into an array first is a little faster. I suspect that this is because the cast has to parse the entire JSON result.
9.2
9.2 does not have the json_agg or to_json functions, so you need to use the older array_to_json:
SELECT array_to_json(array_agg(t)) FROM t
You can optionally include a row_to_json call in the query:
SELECT array_to_json(array_agg(row_to_json(t))) FROM t
This converts each row to a JSON object, aggregates the JSON objects as an array, and then converts the array to a JSON array.
I wasn't able to discern any significant performance difference between the two.
Object of lists
This section describes how to generate a JSON object, with each key being a column in the table and each value being an array of the values of the column. It's the result that looks like this:
{"a":[1,2,3], "b":["value1","value2","value3"]}
9.5 and up
We can leverage the json_build_object function:
SELECT
json_build_object(
'a', json_agg(t.a),
'b', json_agg(t.b)
)
FROM t
You can also aggregate the columns, creating a single row, and then convert that into an object:
SELECT to_json(r)
FROM (
SELECT
json_agg(t.a) AS a,
json_agg(t.b) AS b
FROM t
) r
Note that aliasing the arrays is absolutely required to ensure that the object has the desired names.
Which one is clearer is a matter of opinion. If using the json_build_object function, I highly recommend putting one key/value pair on a line to improve readability.
You could also use array_agg in place of json_agg, but my testing indicates that json_agg is slightly faster.
There is no jsonb version of the json_build_object function. You can aggregate into a single row and convert:
SELECT to_jsonb(r)
FROM (
SELECT
array_agg(t.a) AS a,
array_agg(t.b) AS b
FROM t
) r
Unlike the other queries for this kind of result, array_agg seems to be a little faster when using to_jsonb. I suspect this is due to overhead parsing and validating the JSON result of json_agg.
Or you can use an explicit cast:
SELECT
json_build_object(
'a', json_agg(t.a),
'b', json_agg(t.b)
)::jsonb
FROM t
The to_jsonb version allows you to avoid the cast and is faster, according to my testing; again, I suspect this is due to overhead of parsing and validating the result.
9.4 and 9.3
The json_build_object function was new to 9.5, so you have to aggregate and convert to an object in previous versions:
SELECT to_json(r)
FROM (
SELECT
json_agg(t.a) AS a,
json_agg(t.b) AS b
FROM t
) r
or
SELECT to_jsonb(r)
FROM (
SELECT
array_agg(t.a) AS a,
array_agg(t.b) AS b
FROM t
) r
depending on whether you want json or jsonb.
(9.3 does not have jsonb.)
9.2
In 9.2, not even to_json exists. You must use row_to_json:
SELECT row_to_json(r)
FROM (
SELECT
array_agg(t.a) AS a,
array_agg(t.b) AS b
FROM t
) r
Documentation
Find the documentation for the JSON functions in JSON functions.
json_agg is on the aggregate functions page.
Design
If performance is important, ensure you benchmark your queries against your own schema and data, rather than trust my testing.
Whether it's a good design or not really depends on your specific application. In terms of maintainability, I don't see any particular problem. It simplifies your app code and means there's less to maintain in that portion of the app. If PG can give you exactly the result you need out of the box, the only reason I can think of to not use it would be performance considerations. Don't reinvent the wheel and all.
Nulls
Aggregate functions typically give back NULL when they operate over zero rows. If this is a possibility, you might want to use COALESCE to avoid them. A couple of examples:
SELECT COALESCE(json_agg(t), '[]'::json) FROM t
Or
SELECT to_jsonb(COALESCE(array_agg(t), ARRAY[]::t[])) FROM t
Credit to Hannes Landeholm for pointing this out
Also if you want selected fields from the table and aggregate them as an array:
SELECT json_agg(json_build_object('data_a',a,
'data_b',b,
)) from t;
The result will look like this:
[{'data_a':1,'data_b':'value1'}
{'data_a':2,'data_b':'value2'}]

Left join table ON row with JSON values?

This one is tough , I have 2 tables that I need to join on specific row and issue is that first table row is json value
this is the json row from table items
[{"id":"15","value":"News Title"},{"id":"47","value":"image1.jpg"},{"id":"33","value":"$30"}]
this is the table attributes that I need to join on json ID and get the actual attribute name like Title , Image , Price
id Name
15 Title
47 Image
33 Price
so the start is
SELECT item_values FROM ujm_items
LEFT JOIN?????
WHERE category = 8 AND published = 1 ORDER BY created DESC
but left join on json , have no clue.
any help is appreciated.
... and this is why you don't store structured data in a single SQL field. It negates the whole purpose of a relational database.
Unless you've got a DB that includes a JSON parser, you've got two options:
a) unreliable string operations to find/extract a particular key/value pair
b) slurp the json into a client which CAN parse back to native, extract the key/values you want, then use some other ID field for the actual joins.
SELECT ...
LEFT JOIN ON SUBSTR(jsonfield, POSITION('"id:"', jsonfield)) etc...
Either way, it utterly torpedoes performance since you can't use indexes on these calculated/derived values.
note that this won't work as is - it's just to demonstrate how utterly ugly this gets.
Fix your tables - normalize the design and don't store JSON data that you need to extract data from. It's one thing to put in a json string that you'll only ever fetch/update in its entirely. It's a completely different thing to have one you need to join on sub-values thereof.