Postgres - Performance of select for large jsonb column - json

We are using Postgres jsonb type in one of our database tables. Table structure is shown as below:
CREATE TABLE IF NOT EXISTS public.draft_document (
id bigserial NOT NULL PRIMARY KEY,
...
document jsonb NOT NULL,
ein_search character varying(11) NOT NULL
);
CREATE INDEX IF NOT EXISTS count_draft_document_idx ON public.draft_document USING btree (ein_search);
CREATE INDEX IF NOT EXISTS read_draft_document_idx ON public.draft_document USING btree (id, ein_search);
The json structure of document column may vary. Below is one example of a possible schema for document:
"withholdingCredit": {
"type": "array",
"items": {
"$ref": "#/definitions/withholding"
}
}
Where the withholding structure (array elements) respects:
"withholding": {
"properties": {
...
"proportionalityIndicator": {
"type": "boolean"
},
"tribute": {
"$ref": "#/definitions/tribute"
},
"payingSourceEin": {
"type": "string"
},
"value": {
"type": "number"
}
...
}
...
},
"tribute": {
"type": "object",
"properties": {
"code": {
"type": "number"
},
"additionalCode": {
"type": "number"
}
...
}
}
Here is an example of the json into document jsonb column:
{
"withholdingCredit":[
{
"value": 15000,
"tribute":{
"code": 1216,
"additionalCode": 2
},
"payingSourceEin": "03985506123132",
"proportionalityIndicator": false
},
...
{
"value": 98150,
"tribute":{
"code": 3155,
"additionalCode": 1
},
"payingSourceEin": "04185506123163",
"proportionalityIndicator": false
}
]
}
The maximum number of elements in the array can vary up to a maximum limit of 100.000 (one hundred thousand) elements. It is a business limit.
We need a paged select query that returns the withholding array disaggregated (1 element per row), where each row also brings the sum of the withholding elements value and the array length.
The query also needs to return the withholdings ordered by proportionalityIndicator, tribute-->code, tribute-->additionalCode, payingSourceEin. Something like:
id
sum
jsonb_array_length
jsonb_array_elements
30900
1.800.027
2300
{"value":15000,"tribute":{"code":1216,...}, ...}
...
...
...
{ ... }
30900
1.800.027
2300
{"value":98150,"tribute":{"code":3155,...}, ...}
We have defined the following query:
SELECT dft.id,
SUM((elem->>'value')::NUMERIC),
jsonb_array_length(dft.document->'withholdingCredit'),
jsonb_array_elements(jsonb_agg(elem
ORDER BY
elem->>'proportionalityIndicator',
(elem->'tribute'->>'code')::NUMERIC,
(elem->'tribute'->>'additionalCode')::NUMERIC,
elem->>'payingSourceEin'))
FROM
draft_document dft
CROSS JOIN LATERAL jsonb_array_elements(dft.document->'withholdingCredit') arr(elem)
WHERE (dft.document->'withholdingCredit') IS NOT NULL
AND dft.id = :id
AND dft.ein_search = :ein_search
GROUP BY dft.id
LIMIT :limit OFFSET :offset;
This query works, but with performance limitation when we have a large number of elements into the jsonb array.
Any suggestion on how to improve it is welcome.
BTW, we are using Postgres 9.6.

Your weird query which breaks it apart, aggregates it, and breaks is apart again does seem to trigger some pathological memory management issue in PostgreSQL (tested on 15dev). Maybe you should file a bug report on that.
But you can avoid the problem by just breaking it apart one time. Then you need to use a window function to get the tabulations you want to include all rows even those removed by the offset and limit.
SELECT dft.id,
SUM((elem->>'value')::NUMERIC) over (),
count(*) over (),
elem
FROM
draft_document dft
CROSS JOIN LATERAL jsonb_array_elements(dft.document->'withholdingCredit') arr(elem)
WHERE (dft.document->'withholdingCredit') IS NOT NULL
AND dft.id = 4
AND dft.ein_search = '4'
ORDER BY
elem->>'proportionalityIndicator',
(elem->'tribute'->>'code')::NUMERIC,
(elem->'tribute'->>'additionalCode')::NUMERIC,
elem->>'payingSourceEin'
limit 4 offset 500;
In my hands this gives the same answer as your query, but takes 370 ms rather than 13,789 ms.
At higher offsets than that, my query still works while yours leads to a total lock up requiring a hard reset.
If anyone wants to reproduce the poor behavior, I generated the data by:
insert into draft_document select 4, jsonb_build_object('withholdingCredit',jsonb_agg(jsonb_build_object('value',floor(random()*99999)::int,'tribute','{"code": 1216, "additionalCode": 2}'::jsonb,'payingSourceEin',floor(random()*99999999)::int,'proportionalityIndicator',false))),'4' from generate_series(1,100000) group by 1,3;

Related

Update JSON Array in Postgres with specific key

I have a complex array which look like following in a table column:
{
"sometag": {},
"where": [
{
"id": "Krishna",
"nick": "KK",
"values": [
"0"
],
"function": "ADD",
"numValue": [
"0"
]
},
{
"id": "Krishna1",
"nick": "KK1",
"values": [
"0"
],
"function": "SUB",
"numValue": [
"0"
]
}
],
"anotherTag": [],
"TagTag": {
"tt": "tttttt",
"tt1": "tttttt"
}
In this array, I want to update the function and numValue of id: "Krishna".
Kindly help.
This is really nasty because
Updating an element inside a JSON array always requires to expand the array
On-top: The array is nested
The identfier for the elements to update is a sibling not a parent, which means, you have to filter by a sibling
So I came up with a solution, but I want to disclaim: You should avoid doing this as regular database action! Better would be:
Parsing your JSON in the backend and do the operations in your backend code
Normalize the JSON in your database if that would be a common task, meaning: Create tables with appropriate columns and extract your JSON into the table structure. Do not store entire JSON objects in the database! That would make every single task much more easier and incredible more performant!
demo:db<>fiddle
SELECT
jsonb_set( -- 5
(SELECT mydata::jsonb FROM mytable),
'{where}',
updated_array
)::json
FROM (
SELECT
jsonb_agg( -- 4
CASE WHEN array_elem ->> 'id' = 'Krishna' THEN
jsonb_set( -- 3
jsonb_set(array_elem.value::jsonb, '{function}', '"ADDITION"'::jsonb), -- 2
'{numValue}',
'["0","1"]'::jsonb
)
ELSE array_elem::jsonb END
) as updated_array
FROM mytable,
json_array_elements(mydata -> 'where') array_elem -- 1
) s
Extract the nested array elements into one element per row
Replace function value. Note the casts from type json to type jsonb. That is necessary because there's no json_set() function but only jsonb_set(). Naturally, if you just have type jsonb, the casts are not necessary.
Replace numValue value
Reaggregate the array
Replace the where value of the original JSON object with the newly created array object.

How can Postgres extract parts of json, including arrays, into another JSON field?

I'm trying to convince PostgreSQL 13 to pull out parts of a JSON field into another field, including a subset of properties within an array based on a discriminator (type) property. For example, given a data field containing:
{
"id": 1,
"type": "a",
"items": [
{ "size": "small", "color": "green" },
{ "size": "large", "color": "white" }
]
}
I'm trying to generate new_data like this:
{
"items": [
{ "size": "small" },
{ "size": "large"}
]
}
items can contain any number of entries. I've tried variations of SQL something like:
UPDATE my_table
SET new_data = (
CASE data->>'type'
WHEN 'a' THEN
json_build_object(
'items', json_agg(json_array_elements(data->'items') - 'color')
)
ELSE
null
END
);
but I can't seem to get it working. In this case, I get:
ERROR: set-returning functions are not allowed in UPDATE
LINE 6: 'items', json_agg(json_array_elements(data->'items')...
I can get a set of items using json_array_elements(data->'items') and thought I could roll this up into a JSON array using json_agg and remove unwanted keys using the - operator. But now I'm not sure if what I'm trying to do is possible. I'm guessing it's a case of PEBCAK. I've got about a dozen different types each with slightly different rules for how new_data should look, which is why I'm trying to fit the value for new_data into a type-based CASE statement.
Any tips, hints, or suggestions would be greatly appreciated.
One way is to handle the set json_array_elements() returns in a subquery.
UPDATE my_table
SET new_data = CASE
WHEN data->>'type' = 'a' THEN
(SELECT json_build_object('items',
json_agg(jae.item::jsonb - 'color'))
FROM json_array_elements(data->'items') jae(item))
END;
db<>fiddle
Also note that - isn't defined for json only for jsonb. So unless your columns are actually jsonb you need a cast. And you don't need an explicit ... ELSE NULL ... in a CASE expression, NULL is already the default value if no other value is specified in an ELSE branch.

Filtering on JSON internal keys stored in PostgreSQL table

I have a report in JSON format stored in a field in a PostgreSQL database table.
Say the (simplified) table format is:
Column | Type
-------------------+----------------------------
id | integer
element_id | character varying(256)
report | json
and the structure of the data in the reports is like this
{
"section1":
"test1": {
"outcome": "nominal",
"results": {
"value1": 34.,
"value2": 56.
}
},
"test2": {
"outcome": "warning",
"results": {
"avg": 4.5,
"std": 21.
}
},
...
"sectionN": {
...
}
}
That is, there are N keys at first level (the sections), each of them being an object with a set of keys (the tests), with a outcome and a variable set of results in form of (key, value) pairs.
I need to do filtering based on internal JSON keys. More specifically, in this example, I want to know if it is possible, using SQL alone, to obtain the elements that have, for example, the std value in the results section above a certain threshold, say 10. I can even know that the std is in test2, but I do not know a priori in which section. With this filter (test2.std > 10.), for example, the record with the sample data shown above will appear, since the std variable in the test2 test has this value equal to 21. (>10.).
Another, simpler, filter could be to request all the records for which the test2.outcome is not nominal.
One way is jsonb_each, like:
select section.key
, test.key
from t1
cross join
jsonb_each(t1.col1) section
cross join
jsonb_each(section.value) test
where (test.value->'results'->>'std')::int > 10
Example at SQL Fiddle.

N1QL nested json, query on field inside object inside array

I have json documents in my Couchbase cluster that looks like this
{
"giata_properties": {
"propertyCodes": {
"provider": [
{
"code": [
{
"value": [
{
"name": "Country Code",
"value": "EG"
},
{
"name": "City Code",
"value": "HRG"
},
{
"name": "Hotel Code",
"value": "91U"
}
]
}
],
"providerCode": "gta",
"providerType": "gds"
},
{
"code": [
{
"value": [
{
"value": "071801"
}
]
},
{
"value": [
{
"value": "766344"
}
]
}
],
"providerCode": "restel",
"providerType": "gds"
},
{
"code": [
{
"value": [
{
"value": "HRG03Z"
}
]
},
{
"value": [
{
"value": "HRG04Z"
}
]
}
],
"providerCode": "5VF",
"providerType": "tourOperator"
}
]
}
}
}
I'm trying to create a query that fetches a single document based on the value of giata_properties.propertyCodes.provider.code.value.value and a specific providerType.
So for example, my input is 071801 and restel, I want a query that will fetch me the document I pasted above (because it contains these values).
I'm pretty new to N1QL so what I tried so far is (without the providerType input)
SELECT * FROM giata_properties AS gp
WHERE ANY `field` IN `gp.propertyCodes.provider.code.value` SATISFIES `field.value` = '071801' END;
This returns me an empty result set. I'm probably doing all of this wrongly.
edit1:
According to geraldss answer I was able to achieve my goal via 2 different queries
1st (More general) ~2m50.9903732s
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp SATISFIES v.`value` = '071801' END;
2nd (More specific) ~2m31.3660388s
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp.propertyCodes.provider[*].code SATISFIES v.`value` = '071801' END;
Bucket have around 550K documents. No indexes but the primary currently.
Question part 2
When I do either of the above queries, I get a result streamed to my shell very quickly, then I spend the rest of the query time waiting for the engine to finish iterating over all documents. I'm sure that I'll be only getting 1 result from future queries so I thought I can use LIMIT 1 so the engine stops searching on first result, I tried something like
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp SATISFIES v.`value` = '071801' END LIMIT 1;
But that made no difference, I get a document written to my shell and then keep waiting until the query finishes completely. How can this be configured correctly?
edit2:
I've upgraded to the latest enterprise 4.5.1-2844, I have only the primary index created on giata_properties bucket, when I execute the query along with the LIMIT 1 keyword it still takes the same time, it doesn't stop quicker.
I've also tried creating the array index you suggested but the query is not using the index and it keeps insisting on using the #primary index (even if I use USE INDEX clause).
I tried removing SELF from the index you suggested and it took a much longer time to build and now the query can use this new index, but I'm honestly not sure what I'm doing here.
So 3 questions:
1) Why LIMIT 1 using primary index doesn't make the query stop at first result?
2) What's the difference between the index you suggested with and without SELF? I tried to look for SELF keyword documentation but I couldn't find anything.
This is how both indexes look in Web ui
Index 1 (Your original suggestion) - Not working
CREATE INDEX `gp_idx1` ON `giata_properties`((distinct (array (`v`.`value`) for `v` within (array_star((((self.`giata_properties`).`propertyCodes`).`provider`)).`code`) end)))
Index 2 (Without SELF)
CREATE INDEX `gp_idx2` ON `giata_properties`((distinct (array (`v`.`value`) for `v` within (array_star(((self.`propertyCodes`).`provider`)).`code`) end)))
3) What would be the query for a specific giata_properties.propertyCodes.provider.code.value.value and a specific providerCode? I managed to do both separately but I wasn't successful in merging them.
Thanks for all your help dear
Here is a query without the providerType.
EXPLAIN SELECT *
FROM giata_properties AS gp
WHERE ANY v WITHIN gp.giata_properties.propertyCodes.provider[*].code SATISFIES v.`value` = '071801' END;
You can also index this in Couchbase 4.5.0 and above.
CREATE INDEX idx1 ON giata_properties( DISTINCT ARRAY v.`value` FOR v WITHIN SELF.giata_properties.propertyCodes.provider[*].code END );
Edit to answer question edits
The performance has been addressed in 4.5.x. You should try the following on Couchbase 4.5.1 and post the execution times here.
Test on 4.5.1.
Create the index.
Use the LIMIT. In 4.5.1, the limit is pushed down to the index.

Improving the performance of aggregating the value of a key spread across multiple JSON rows

I'm currently storing the data in the following format (JSON) in a Redis ZSET. The score is the timestamp in miliseconds.
<timestamp_1> - [ { "key1" : 200 }, { "key2": 100 }, {"key3" : 5 }, .... {"key_n" : 1} ]
<timestamp_2> - [ { "key50" : 500 }, { "key2": 300 }, {"key3" : 290 }, ....{"key_m" : 26} ]
....
....
<timestamp_k> - [ { "key1" : 100 }, { "key2": 200 }, {"key3" : 50 }, ....{"key_p" : 150} ]
I want to extract the values for a key between a given time range.
For example, the values of key2 will in the above example for the entire time range would be.
[timestamp_1:100, timestamp_2:300, ..... timestamp_k:200]
I can get the current output but I've to parse the JSON for each row and then iterate through it to get the value of a given key in each row. The parsing becomes a bottleneck as the size of each row increases (n,m, and p can be as big as 10000).
I'm looking for suggestions about if there is a way to improve the performance in Redis? Are there any specific parsers (in Scala) that can help here.
I'm also open to using other stores such as Cassandra and Elasticsearch if they give better performance. I'm also open to other formats apart from JSON to store the data in Redis ZSet.
Cassandra will work just fine for your requirement.
You can keep key_id as the partitioning key and timestamp as the row key.
You always define your query before designing your column family in cassandra. extract the values for a key between a given time range.
If you are using CQL3,
Create schema:
CREATE TABLE imp_keys (key_id text, score int, timestamp timeuuid,PRIMARY KEY(key_id,timestamp));
Access data:
SELECT score FROM imp_keys WHERE key_id=key2 AND timestamp > maxTimeuuid(start_date) AND timestamp < maxTimeuuid(end_date);