Want to use inner element sum in case condition in N1QL couchbase - couchbase

I want to run like below query
SELECT round(sum (ARRAY_SUM(case when ANY x IN transactions SATISFIES
x.type` `in [0,4] then transactions[*].amount else 0 end))),2)
total_income,_type` `FROM mybucket WHERE _type='Company'
I have multiple json like below
{
"_type": "Company",
"created": "2015-12-01T18:30:00.000Z",
"transactions": [
{
"amount": "96.5",
"date": "2016-01-03T18:30:00.000Z",
"type": 0
},
{
"amount": "483.7",
"date": "2016-01-10T18:30:00.000Z",
"type": 0
}
]
}
I want to sum of transactions->amount which has type in [0,1] I want it in case condition. How can I do it ??

SELECT CASE WHEN array_count(a) > 0 THEN ARRAY_SUM(a) ELSE 0 END
FROM default
LET a = ARRAY TONUMBER(x.amount) FOR x in transactions WHEN x.type IN [0,4] END
WHERE _type = "Company";

btw, you can write this query using different language constructs in N1QL. Check UNNEST, and array-indexing. Especially, when you have filters on array elements (such as transactions[*].type), you can leverage array indexing for better perf and pushdown any filtering to where clause (and indexes).

Related

Postgres - Performance of select for large jsonb column

We are using Postgres jsonb type in one of our database tables. Table structure is shown as below:
CREATE TABLE IF NOT EXISTS public.draft_document (
id bigserial NOT NULL PRIMARY KEY,
...
document jsonb NOT NULL,
ein_search character varying(11) NOT NULL
);
CREATE INDEX IF NOT EXISTS count_draft_document_idx ON public.draft_document USING btree (ein_search);
CREATE INDEX IF NOT EXISTS read_draft_document_idx ON public.draft_document USING btree (id, ein_search);
The json structure of document column may vary. Below is one example of a possible schema for document:
"withholdingCredit": {
"type": "array",
"items": {
"$ref": "#/definitions/withholding"
}
}
Where the withholding structure (array elements) respects:
"withholding": {
"properties": {
...
"proportionalityIndicator": {
"type": "boolean"
},
"tribute": {
"$ref": "#/definitions/tribute"
},
"payingSourceEin": {
"type": "string"
},
"value": {
"type": "number"
}
...
}
...
},
"tribute": {
"type": "object",
"properties": {
"code": {
"type": "number"
},
"additionalCode": {
"type": "number"
}
...
}
}
Here is an example of the json into document jsonb column:
{
"withholdingCredit":[
{
"value": 15000,
"tribute":{
"code": 1216,
"additionalCode": 2
},
"payingSourceEin": "03985506123132",
"proportionalityIndicator": false
},
...
{
"value": 98150,
"tribute":{
"code": 3155,
"additionalCode": 1
},
"payingSourceEin": "04185506123163",
"proportionalityIndicator": false
}
]
}
The maximum number of elements in the array can vary up to a maximum limit of 100.000 (one hundred thousand) elements. It is a business limit.
We need a paged select query that returns the withholding array disaggregated (1 element per row), where each row also brings the sum of the withholding elements value and the array length.
The query also needs to return the withholdings ordered by proportionalityIndicator, tribute-->code, tribute-->additionalCode, payingSourceEin. Something like:
id
sum
jsonb_array_length
jsonb_array_elements
30900
1.800.027
2300
{"value":15000,"tribute":{"code":1216,...}, ...}
...
...
...
{ ... }
30900
1.800.027
2300
{"value":98150,"tribute":{"code":3155,...}, ...}
We have defined the following query:
SELECT dft.id,
SUM((elem->>'value')::NUMERIC),
jsonb_array_length(dft.document->'withholdingCredit'),
jsonb_array_elements(jsonb_agg(elem
ORDER BY
elem->>'proportionalityIndicator',
(elem->'tribute'->>'code')::NUMERIC,
(elem->'tribute'->>'additionalCode')::NUMERIC,
elem->>'payingSourceEin'))
FROM
draft_document dft
CROSS JOIN LATERAL jsonb_array_elements(dft.document->'withholdingCredit') arr(elem)
WHERE (dft.document->'withholdingCredit') IS NOT NULL
AND dft.id = :id
AND dft.ein_search = :ein_search
GROUP BY dft.id
LIMIT :limit OFFSET :offset;
This query works, but with performance limitation when we have a large number of elements into the jsonb array.
Any suggestion on how to improve it is welcome.
BTW, we are using Postgres 9.6.
Your weird query which breaks it apart, aggregates it, and breaks is apart again does seem to trigger some pathological memory management issue in PostgreSQL (tested on 15dev). Maybe you should file a bug report on that.
But you can avoid the problem by just breaking it apart one time. Then you need to use a window function to get the tabulations you want to include all rows even those removed by the offset and limit.
SELECT dft.id,
SUM((elem->>'value')::NUMERIC) over (),
count(*) over (),
elem
FROM
draft_document dft
CROSS JOIN LATERAL jsonb_array_elements(dft.document->'withholdingCredit') arr(elem)
WHERE (dft.document->'withholdingCredit') IS NOT NULL
AND dft.id = 4
AND dft.ein_search = '4'
ORDER BY
elem->>'proportionalityIndicator',
(elem->'tribute'->>'code')::NUMERIC,
(elem->'tribute'->>'additionalCode')::NUMERIC,
elem->>'payingSourceEin'
limit 4 offset 500;
In my hands this gives the same answer as your query, but takes 370 ms rather than 13,789 ms.
At higher offsets than that, my query still works while yours leads to a total lock up requiring a hard reset.
If anyone wants to reproduce the poor behavior, I generated the data by:
insert into draft_document select 4, jsonb_build_object('withholdingCredit',jsonb_agg(jsonb_build_object('value',floor(random()*99999)::int,'tribute','{"code": 1216, "additionalCode": 2}'::jsonb,'payingSourceEin',floor(random()*99999999)::int,'proportionalityIndicator',false))),'4' from generate_series(1,100000) group by 1,3;

How can Postgres extract parts of json, including arrays, into another JSON field?

I'm trying to convince PostgreSQL 13 to pull out parts of a JSON field into another field, including a subset of properties within an array based on a discriminator (type) property. For example, given a data field containing:
{
"id": 1,
"type": "a",
"items": [
{ "size": "small", "color": "green" },
{ "size": "large", "color": "white" }
]
}
I'm trying to generate new_data like this:
{
"items": [
{ "size": "small" },
{ "size": "large"}
]
}
items can contain any number of entries. I've tried variations of SQL something like:
UPDATE my_table
SET new_data = (
CASE data->>'type'
WHEN 'a' THEN
json_build_object(
'items', json_agg(json_array_elements(data->'items') - 'color')
)
ELSE
null
END
);
but I can't seem to get it working. In this case, I get:
ERROR: set-returning functions are not allowed in UPDATE
LINE 6: 'items', json_agg(json_array_elements(data->'items')...
I can get a set of items using json_array_elements(data->'items') and thought I could roll this up into a JSON array using json_agg and remove unwanted keys using the - operator. But now I'm not sure if what I'm trying to do is possible. I'm guessing it's a case of PEBCAK. I've got about a dozen different types each with slightly different rules for how new_data should look, which is why I'm trying to fit the value for new_data into a type-based CASE statement.
Any tips, hints, or suggestions would be greatly appreciated.
One way is to handle the set json_array_elements() returns in a subquery.
UPDATE my_table
SET new_data = CASE
WHEN data->>'type' = 'a' THEN
(SELECT json_build_object('items',
json_agg(jae.item::jsonb - 'color'))
FROM json_array_elements(data->'items') jae(item))
END;
db<>fiddle
Also note that - isn't defined for json only for jsonb. So unless your columns are actually jsonb you need a cast. And you don't need an explicit ... ELSE NULL ... in a CASE expression, NULL is already the default value if no other value is specified in an ELSE branch.

How to query deep nested json value from couchbase?

How to query deep nested json value from couchbase? I have the following documents in the couchbase bucket. I need to query appversion>3.2.1 OR appversion <3.3.0 OR appversion=3.4.1.
How to query these values from nested json?
My Json Documents,
Document 1:
com.whatsapp_1
{
"doc-type": "App-Metadata",
"bundleid": "com.whatsapp",
"value": {
"appId": "com.whatsapp",
"appName": "WhatsApp Messenger",
"primaryCategoryName": "Communication"
}
}
Document 2:
com.whatsapp_2
{
"doc-type": "App-Lookalike",
"bundleid": "com.whatsapp",
"value": {
"com.facebook.orca": 476664,
"org.telegram.messenger.erick.lite": 423132,
"com.viber.voip": 286410,
"messenger.free.video.call.chat": 232830,
"com.facebook.katana": 223000,
"com.wChatMessenger_6210995": 219960,
"com.facebook.talk": 187884
}
}
Document 3:
com.whatsapp_3
{
"doc-type": "Internal-Metadata",
"bundleid": "com.whatsapp",
"value": {
"appversion": "3.4.1"
}
}
value is reserved keyword, you need to use back-ticks around it.
SELECT *
FROM sampleBucket
WHERE `doc-type` = 'Internal-Metadata' AND
(`value`.appversion>"3.2.1" OR
`value`.appversion <"3.3.0" OR
`value`.appversion="3.4.1");
To query nested entities you should use the unnest keyword:
https://dzone.com/articles/nesting-and-unnesting-in-couchbase-n1ql
In your case, it will be something similar to:
select t.* from mybucket t UNNEST `t.value` v where t.doc-type = 'Internal-Metadata' and v.appversion = '3.2.1'
As you are app versions are String, you should use the replace function to remove "." and then convert it to int before the comparison
https://docs.couchbase.com/server/5.5/n1ql/n1ql-language-reference/stringfun.html#fn-str-replace
I'm not quite sure what you want, but if you want a query that only returns document 3, this query should do it.
SELECT *
FROM sampleBucket
WHERE value.appversion>"3.2.1" OR value.appversion <"3.3.0" OR value.appversion="3.4.1"
This should return only the third document. The query also assumes all app versions are of the from x.y.z where x, y, and z are single-digit numbers.
If that's not the result you are looking for, please explain more precisely what you want.

N1QL Distinct Query on Nested Arrays

(Couchbase 4.5) Suppose I have the following object stored in my couchbase instance:
{
parentArray : [
{
childArray: [{value: 'v1'}, {value:'v2'}]
},
{
childArray: [{value: 'v1'}, {value: 'v3'}]
}
]
}
Now I want to select the distinct elements from childArray, which should return an array equal to ['v1', 'v2', 'v3'].
I have a couple solutions to this. My first thought was to go ahead and use the UNNEST operation:
SELECT DISTINCT ca.value FROM `my-bucket` AS b UNNEST b.parentArray AS pa UNNEST pa.childArray AS ca WHERE _class="someclass" AND dataType="someDataType";
With this approach I get a polynomial explosion in the number of scanned elements (due to the unnest'ing of two arrays), and the query takes a bit of time to complete (for my real data on the order of 24 seconds). When I remove unnest, and simply query for distinct elements on the top-level elements (those adjacent to parentArray), it takes on the order of milliseconds.
Another solution is to handle this in the application code, by simply iterating through the returned values and finding the distinct values my-self. This approach is bad, because it brings too much data into the application space.
Any help please!
Thank you!
UPDATE: Looks like without a "WHERE" clause using the "UNNEST" statements the performance is fast. So do I need Array Indexes here?
UPDATE: Nevermind about the previous update, since there is no index elements in the where clause. Also, but I do notice that if I remove the UNNEST OR the WHERE then the query is fast. Moreover, looking at the explain and adding a GSI for compound index (_class, dataType) I can see "IndexScan" on the provided index.
INSERT INTO default values("3",{ "parentArray" : [ { "childArray": [{"value": 'v1'}, {"value":'v2'}] }, { "childArray": [{"value": 'v1'}, {"value": 'v3'}] } ] });
SELECT ARRAY_DISTINCT(ARRAY v.`value` FOR v WITHIN parentArray END) FROM default;
OR
SELECT ARRAY_DISTINCT(ARRAY_FLATTEN(
ARRAY ARRAY v.`value` FOR v IN ca.childArray END FOR ca IN parentArray END,
2)) FROM default;
You can add where clause. If this requires across the documents use the following.
INSERT INTO default values("4",{ "parentArray" : [ { "childArray": [{"value": 'v5'}, {"value":'v2'}] }, { "childArray": [{"value": 'v1'}, {"value": 'v3'}] } ] });
SELECT ARRAY_DISTINCT(ARRAY_FLATTEN(ARRAY_AGG(ARRAY v.`value` FOR v WITHIN parentArray END),2)) FROM default;
SELECT ARRAY_DISTINCT(ARRAY_FLATTEN(ARRAY_AGG(ARRAY_FLATTEN(ARRAY ARRAY v.`value` FOR v IN ca.childArray END FOR ca IN parentArray END,2)),2)) FROM default;

N1QL nested json, query on field inside object inside array

I have json documents in my Couchbase cluster that looks like this
{
"giata_properties": {
"propertyCodes": {
"provider": [
{
"code": [
{
"value": [
{
"name": "Country Code",
"value": "EG"
},
{
"name": "City Code",
"value": "HRG"
},
{
"name": "Hotel Code",
"value": "91U"
}
]
}
],
"providerCode": "gta",
"providerType": "gds"
},
{
"code": [
{
"value": [
{
"value": "071801"
}
]
},
{
"value": [
{
"value": "766344"
}
]
}
],
"providerCode": "restel",
"providerType": "gds"
},
{
"code": [
{
"value": [
{
"value": "HRG03Z"
}
]
},
{
"value": [
{
"value": "HRG04Z"
}
]
}
],
"providerCode": "5VF",
"providerType": "tourOperator"
}
]
}
}
}
I'm trying to create a query that fetches a single document based on the value of giata_properties.propertyCodes.provider.code.value.value and a specific providerType.
So for example, my input is 071801 and restel, I want a query that will fetch me the document I pasted above (because it contains these values).
I'm pretty new to N1QL so what I tried so far is (without the providerType input)
SELECT * FROM giata_properties AS gp
WHERE ANY `field` IN `gp.propertyCodes.provider.code.value` SATISFIES `field.value` = '071801' END;
This returns me an empty result set. I'm probably doing all of this wrongly.
edit1:
According to geraldss answer I was able to achieve my goal via 2 different queries
1st (More general) ~2m50.9903732s
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp SATISFIES v.`value` = '071801' END;
2nd (More specific) ~2m31.3660388s
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp.propertyCodes.provider[*].code SATISFIES v.`value` = '071801' END;
Bucket have around 550K documents. No indexes but the primary currently.
Question part 2
When I do either of the above queries, I get a result streamed to my shell very quickly, then I spend the rest of the query time waiting for the engine to finish iterating over all documents. I'm sure that I'll be only getting 1 result from future queries so I thought I can use LIMIT 1 so the engine stops searching on first result, I tried something like
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp SATISFIES v.`value` = '071801' END LIMIT 1;
But that made no difference, I get a document written to my shell and then keep waiting until the query finishes completely. How can this be configured correctly?
edit2:
I've upgraded to the latest enterprise 4.5.1-2844, I have only the primary index created on giata_properties bucket, when I execute the query along with the LIMIT 1 keyword it still takes the same time, it doesn't stop quicker.
I've also tried creating the array index you suggested but the query is not using the index and it keeps insisting on using the #primary index (even if I use USE INDEX clause).
I tried removing SELF from the index you suggested and it took a much longer time to build and now the query can use this new index, but I'm honestly not sure what I'm doing here.
So 3 questions:
1) Why LIMIT 1 using primary index doesn't make the query stop at first result?
2) What's the difference between the index you suggested with and without SELF? I tried to look for SELF keyword documentation but I couldn't find anything.
This is how both indexes look in Web ui
Index 1 (Your original suggestion) - Not working
CREATE INDEX `gp_idx1` ON `giata_properties`((distinct (array (`v`.`value`) for `v` within (array_star((((self.`giata_properties`).`propertyCodes`).`provider`)).`code`) end)))
Index 2 (Without SELF)
CREATE INDEX `gp_idx2` ON `giata_properties`((distinct (array (`v`.`value`) for `v` within (array_star(((self.`propertyCodes`).`provider`)).`code`) end)))
3) What would be the query for a specific giata_properties.propertyCodes.provider.code.value.value and a specific providerCode? I managed to do both separately but I wasn't successful in merging them.
Thanks for all your help dear
Here is a query without the providerType.
EXPLAIN SELECT *
FROM giata_properties AS gp
WHERE ANY v WITHIN gp.giata_properties.propertyCodes.provider[*].code SATISFIES v.`value` = '071801' END;
You can also index this in Couchbase 4.5.0 and above.
CREATE INDEX idx1 ON giata_properties( DISTINCT ARRAY v.`value` FOR v WITHIN SELF.giata_properties.propertyCodes.provider[*].code END );
Edit to answer question edits
The performance has been addressed in 4.5.x. You should try the following on Couchbase 4.5.1 and post the execution times here.
Test on 4.5.1.
Create the index.
Use the LIMIT. In 4.5.1, the limit is pushed down to the index.