LEFT OUTER JOIN + WHERE clause in Couchbase - couchbase

I am trying to perform a LEFT OUTER JOIN while filtering on the right part of the join.
I have created the following index to achieve this:
CREATE INDEX `idx_store_order` ON `myBucket`(("Store::" || `storeId`)) WHERE ((`docType` = "Order") or (`docType` is missing))
and I am trying to execute the following query:
SELECT store.status, order.clientId, store.docId
FROM myBucket store
LEFT OUTER JOIN myBucket order ON KEY ("Store::" || order.storeId) FOR store
WHERE store.docType="Store"
AND (order.docType="Order" OR order.docType IS MISSING)
AND order.clientId="9281ae36-a418-4ea3-93f0-bfd7b1a38248"
I have 30 documents with docType="Store", but when I perform this query I don't get the 30 results. If I remove the last clause and group by store, then I get the 30 results, so it's the last clause that affects the final results.
I have also tried the following statement (unsucessfully) as the last clause:
(AND order.clientId="9281ae36-a418-4ea3-93f0-bfd7b1a38248" OR order.docType IS MISSING)
Am I missing something? Thanks
EDIT
Here's the explain query:
[
{
"plan": {
"#operator": "Sequence",
"~children": [
{
"#operator": "IndexScan",
"index": "idx_docType",
"index_id": "e498d0c0ee2f0d9d",
"keyspace": "myBucket",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"\"Store\""
],
"Inclusion": 3,
"Low": [
"\"Store\""
]
}
}
],
"using": "gsi"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Fetch",
"as": "store",
"keyspace": "myBucket",
"namespace": "default"
},
{
"#operator": "IndexJoin",
"as": "order",
"for": "store",
"keyspace": "myBucket",
"namespace": "default",
"on_key": "(\"Store::\" || (`order`.`storeId`))",
"outer": true,
"scan": {
"index": "idx_store_order",
"index_id": "a97fce5158e6e573",
"using": "gsi"
}
},
{
"#operator": "Filter",
"condition": "((((`store`.`docType`) = \"Store\") and (((`order`.`docType`) = \"Order\") or ((`order`.`docType`) is missing))) and (((`order`.`clientId`) = \"9281ae36-a418-4ea3-93f0-bfd7b1a138248\") or (`order` is missing)))"
},
{
"#operator": "InitialProject",
"result_terms": [
{
"expr": "(`store`.`status`)"
}
]
},
{
"#operator": "FinalProject"
}
]
}
}
]
},
"text": "SELECT store.status\nFROM myBucket store\nLEFT OUTER JOIN myBucket order ON KEY (\"Store::\" || order.storeId) FOR store\nWHERE store.docType=\"Store\"\nAND (order.docType=\"Order\" OR order.docType IS MISSING)\nAND (order.clientId=\"9281ae36-a418-4ea3-93f0-bfd7b1a138248\" OR order IS MISSING)"
}
]
EDIT2
As discussed in the comments, I want to list all stores, regardless of a given customer having orders in it or not. If the customer does have orders, then I want to show certain fields along with the list of stores.
E.g.
Store 1 - Client X does not have orders
Store 2 - Client X does have one order, and some information is shown along the store info

Outer joins produce all left side documents irrespective of successfully matching the join-key predicate (and not any condition in your where-clause). That means, you get 30 results whether you have matching order.storeId or not.
In this case, the last filter is on client-ID, which is applied post JOIN, and hence is filtering some documents. Check/post the EXPLAIN output to validate.

In N1QL currently, WHERE clause is not considered part of the JOIN predicate, so you have to do the following. You need to escape order throughout, or use a different alias.
SELECT store.status, order.userId, store.docId
FROM myBucket store
LEFT OUTER JOIN myBucket order ON KEY ("Store::" || order.storeId) FOR store
WHERE store.docType="Store"
AND (
(order IS MISSING)
OR
((order.docType="Order" OR order.docType IS MISSING)
AND order.clientId="9281ae36-a418-4ea3-93f0-bfd7b1a38248")

Related

Using Recursive feature while Flattening in Snowflake

I have a JSON string, which needs to be parsed in order to retrieve particular values.Here is an example I am working with;
{
"assignable_type": "SHIPMENT",
"rule": {
"rules": [
{
"meta_data": {},
"rules": [
{
"op": "IN",
"target": "CLIENT_FID",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
"flx::core:client:dbid/64171",
"flx::core:client:dbid/76049",
"flx::core:client:dbid/34040",
"flx::core:client:dbid/61806"
]
}
],
"type": "AND"
}
],
"type": "OR"
},
"type": "USER_DEFINED"
}
The goal is to get the values when "target":"CLIENT_FID".
Expected Output for this JSON file should be ;
["flx::core:client:dbid/64171",
"flx::core:client:dbid/76049",
"flx::core:client:dbid/34040",
"flx::core:client:dbid/61806"]
Here, as we can see rules is a list of dictionaries, and we can have nested lists as seen in the example.
Similarly, we have other JSON file of following type;
{
"assignable_type": "SHIPMENT",
"rule": {
"rules": [
{
"meta_data": {},
"rules": [
{
"op": "IN",
"target": "PORT_OF_ENTRY_FID",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
"flx::core:port:dbid/566788",
"flx::core:port:dbid/566931",
"flx::core:port:dbid/561482"
]
}
],
"type": "AND"
},
{
"meta_data": {},
"rules": [
{
"op": "IN",
"target": "PORT_OF_LOADING_FID",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
"flx::core:port:dbid/561465"
]
},
{
"op": "IN",
"target": "SHIPMENT_MODE",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
0
]
},
{
"op": "IN",
"target": "CLIENT_FID",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
"flx::core:client:dbid/28169"
]
}
],
"type": "AND"
}
],
"type": "OR"
},
"type": "USER_DEFINED"
}
For the second example ,
Expected Output shd be;
["flx::core:client:dbid/28169"]
As. seen, we may need to read the values at different depths in the file. In order to address this issue, I used following code;
/* first convert the string to a JSON object in cte1 */
with cte1 as (
select to_json(json_string) as json_rep,
parse_json(json_extract_path_text(json_rep, 'rule.rules')) as list_elem
from table 1),
cte2 as (select split_array,
json_extract_path_text(split_array, 'target') as target_client
from (
select json_rep,
list_elem,
t.value as split_array,
typeof(split_array) as obj_type,
index
from cte1,
table(flatten(cte1.list_elem, recursive=>true)) as t) temp /* use recursive feature */
where split_array ilike '%"target":"client_fid"%' /* filter for those rows containing this string */
and obj_type='OBJECT')
select
split_array,
json_extract_path_text(split_array, 'values') as client_values
from cte2
where target_client='CLIENT_FID'; /* filter the rows where we have the dictionary containing client fid */
In order to address the issue of varying depth at which client_fid is found we're recursing while flattening the string into rows. The output which is obtained for both of above inputs is provided below,
For the first String we get the actual output in variable client_values as
["flx::core:client:dbid/64171",
"flx::core:client:dbid/76049",
"flx::core:client:dbid/34040",
"flx::core:client:dbid/61806"]
Similarly, for the second string we get the actual output as
["flx::core:client:dbid/28169"]
As seen the code seems to be working in getting the correct output, but the way I filtered in the final query for target_client='CLIENT_FID'; it seems to be a very hacky way. Hence is it possible to get a better approach to resolve the issue of retrieving client fid values though the depth can vary in the given input.
Help is appreciated.

N1QL Query to join array fields with an array in another document

I have 3 documents types :
Data
{
"formId": "7508e7b2-bcf7-437b-a206-9fee87256d01",
"dataValues": [
{
"questionId": "Someguid123",
"questionValue": "Question1"
},
{
"questionId": "Someguid",
"questionValue": "Question2"
},
{
"questionId": "AnotherGuid",
"questionValue": "Question3"
}
],
"lastUpdateDateTime": "2023-01-04T10:56:49Z",
"type": "Data",
"templateId": "41e4cc2c-e9fb-4bdc-9dc2-af19e5988984",
"creationDateTime": "2022-12-28T11:20:46Z"
}
AttachedDocuments
{
"id": "AttachedDocuments::77961b70-2071-4410-837a-436c908a4fa5",
"lastUpdateDateTime": "2023-01-05T11:47:17Z",
"documents": [
{
"isUploaded": false,
"id": "DocumentMetadata::001",
"isDeleted": false,
"type": "photo",
"parentId": "Someguid123"
},
{
"isUploaded": false,
"id": "DocumentMetadata::002",
"isDeleted": false,
"type": "photo",
"parentId": "Someguid123"
}
],
"type": "AttachedDocuments",
"parentDocId": "MyFormData::7508e7b2-bcf7-437b-a206-9fee87256d01",
"creationDateTime": "2022-12-28T11:20:46Z"
}
DocumentMetaData
{
"id": "DocumentMetadata::001",
"type": "DocumentMetadata",
"name": "MyForm_001.png",
"documentId": "549c4da2-ad3a-4f92-bfa2-019750a11007",
"contentType": "FILE",
"parentDocumentId": "AttachedDocuments::77961b70-2071-4410-837a-436c908a4fa5",
"creationDateTime": "2023-01-04T10:56:49Z"
},
{
"id": "DocumentMetadata::002",
"type": "DocumentMetadata",
"name": "MyForm_002.png",
"documentId": "549c4da2-ad3a-4f92-bfa2-019750a11007",
"contentType": "FILE",
"parentDocumentId": "AttachedDocuments::77961b70-2071-4410-837a-436c908a4fa5",
"creationDateTime": "2023-01-04T10:56:49Z"
}
Every Data type document has only one AttachedDocuments document with parentDocId* field set to formId field of Data document.
If items in Data.dataValues has a document attached to it, AttachedDocuments.documents array have items with parentId field set to Data.dataValues[i].questionId.
Also every AttachedDocuments.documents[i] item has a DocumentMetadata document with id of AttachedDocuments.documents[i].id field.
I want to have a query which returns all Data.dataValues as an array but containing a field links that contains the DocumentMetadata.name field like below :
[
{
"questionId": "Someguid123",
"questionValue": "Question1",
"links": ["MyForm_001.png", "MyForm_002.png"]
},
{
"questionId": "Someguid",
"questionValue": "Question2"
},
{
"questionId": "AnotherGuid",
"questionValue": "Question3"
}
]
I tried unnest clause but couldn't output datavalues items without documents. How should I write the query to include those also?
Thank you
Assuming you have a 1:1 relationship between Data & AttachedDocuments, you can try:
CREATE SCOPE default.f;
CREATE COLLECTION default.f.Data;
CREATE COLLECTION default.f.AttachedDocuments;
CREATE COLLECTION default.f.DocumentMetaData;
CREATE INDEX ix1 ON default.f.DocumentMetaData(id);
SELECT dataValues.questionId, dataValues.questionValue, links
FROM default.f.Data join default.f.AttachedDocuments ON "MyFormData::"||Data.formId = AttachedDocuments.parentDocId
UNNEST Data.dataValues AS dataValues
LET links = (SELECT RAW DocumentMetaData.name
FROM default.f.DocumentMetaData
WHERE DocumentMetaData.parentDocumentId = AttachedDocuments.id
AND id IN ARRAY a.id FOR a IN AttachedDocuments.documents WHEN a.parentId = dataValues.questionId END
)
;
If you have a 1:n relationship between Data & AttachedDocuments but the attachments for a single question are wholly in a single attached document:
CREATE INDEX ix2 ON default.f.AttachedDocuments(parentDocId);
CREATE INDEX ix3 ON default.f.AttachedDocuments(id);
SELECT dataValues.questionId, dataValues.questionValue, links
FROM default.f.Data join default.f.AttachedDocuments ON "MyFormData::"||Data.formId = AttachedDocuments.parentDocId
UNNEST Data.dataValues as dataValues
LET links = (SELECT RAW md.name
FROM default.f.AttachedDocuments ad JOIN default.f.DocumentMetaData md ON ad.id = md.parentDocumentId
UNNEST ad.documents d
WHERE ad.parentDocId = "MyFormData::"||Data.formId
AND d.id = md.id
AND d.parentId = dataValues.questionId
)
WHERE ANY dv IN AttachedDocuments.documents SATISFIES dv.parentId = dataValues.questionId END
;
If attachments for a single question can be spread over multiple attached documents, add a DISTINCT to the above statement.
HTH.
(You can use the same logic without collections adding appropriate aliasing and type field filtering.)

Will there be a performance overhead when using an index having Object_Pairs (in case of a covered query) - Couchbase

Suppose I create an index on Object_pair(values).val.data.
Will my index store the “values” field as an array (with elements name for ID and val for data due to object_pair)?
If so, and also if my n1ql query is a covered query (fetching only Object_pair(values).val.data via select clause), will there still be a performance overhead? (because I am under the impression that in the above case, as index would already contain “values” field as an array, no actual object_pair transformation would take place hence avoiding the overhead. Only in the case of a non-covered query will the actual document be accessed and object_pair transformation done on “values” field).
Couchbase document:
"values": {
"item_1": {
"data": [{
"name": "data_1",
"value": "A"
},
{
"name": "data_2",
"value": "XYZ"
}
]
},
"item_2": {
"data": [{
"name": "data_1",
"value": "123"
},
{
"name": "data_2",
"value": "A23"
}
]
}
}
}```
UPDATE:
suppose if we plan to create index on Object_pair(values)[*].val.data & Object_pair(values)[*].name
Index: CREATE INDEX idx01 ON ent_comms_tracking(ARRAY { value.name, value.val.data} FOR value IN object_pairs(values) END)
Query: SELECT ARRAY { value.name, value.val.data} FOR value IN object_pairs(values) END as values_array FROM bucket
Can you please paste your full create index statement?
Creating index on OBJECT_PAIRS(values).val.data indexes nothing.
You can check it out by creating a primary index and then running below query:
SELECT OBJECT_PAIRS(`values`).val FROM mybucket
Output is:
[
{}
]
OBJECT_PAIRS(values) returns arrays of values which contain the attribute name and value pairs of the object values -
SELECT OBJECT_PAIRS(`values`) FROM mybucket
[
{
"$1": [
{
"name": "item_1",
"val": {
"data": [
{
"name": "data_1",
"value": "A"
},
{
"name": "data_2",
"value": "XYZ"
}
]
}
},
{
"name": "item_2",
"val": {
"data": [
{
"name": "data_1",
"value": "123"
},
{
"name": "data_2",
"value": "A23"
}
]
}
}
]
}
]
It's an array, so val of it is not directly referenced

Couchbase - SELECT a subset of fields from array of objects

I am using the travel-sample data set, and am running the following query:
SELECT id, schedule FROM `travel-sample`WHERE type = "route" LIMIT 1;
It is returning with the following results:
[
{
"id": 10000,
"schedule": [
{
"day": 0,
"flight": "AF198",
"utc": "10:13:00"
},
{
"day": 0,
"flight": "AF547",
"utc": "19:14:00"
},
...
]
}
]
However, I don't want to return the schedule.$.day field; i.e. I want my results to be:
[
{
"id": 10000,
"schedule": [
{
"flight": "AF198",
"utc": "10:13:00"
},
{
"flight": "AF547",
"utc": "19:14:00"
},
...
]
}
]
How can I SELECT only a subset of object fields from an array of objects?
I have tried UNNEST but I don't want to have a separate record for each schedule element - I want the schedule elements to remain nested inside the document.
I have also tried using OBJECT_REMOVE
SELECT id, ARRAY OBJECT_REMOVE(x, 'day') FOR x in schedule END AS schedule FROM `travel-sample` WHERE type = "route" LIMIT 1;
But I want to whitelist rather than blacklist fields.
Your last attempt was close. Instead of using OBJECT_REMOVE, you can simply construct the object you want returned.
SELECT id, ARRAY {"flight": x.flight, "utc": x.utc} FOR x in schedule END AS schedule FROM `travel-sample` WHERE type = "route" LIMIT 1;
You will get the following results:
[
{
"id": 10000,
"schedule": [
{
"flight": "AF198",
"utc": "10:13:00"
},
{
"flight": "AF547",
"utc": "19:14:00"
},
...
]
}
]

Couchbase DISTINCT very slow

I'm working through the free CB110 course on N1QL offered at learn.coucbase.com.
Following query in course's accompanying workbook takes 1 minute:
SELECT DISTINCT address.countryCode
FROM couchmusic2
WHERE email LIKE "%hotmail.com";
I have an gsi on email.
Following query takes milliseconds:
SELECT COUNT(*)
FROM couchmusic2
WHERE email LIKE "%hotmail.com";
which has me believe that DISTINCT is the problem.
EXPLAIN reveals this:
[
{
"plan": {
"#operator": "Sequence",
"~children": [
{
"#operator": "IndexScan",
"index": "idx_email",
"index_id": "c2e612a0d697d8b6",
"keyspace": "couchmusic2",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"[]"
],
"Inclusion": 1,
"Low": [
"\"\""
]
}
}
],
"using": "gsi"
},
{
"#operator": "Fetch",
"keyspace": "couchmusic2",
"namespace": "default"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Filter",
"condition": "((`couchmusic2`.`email`) like \"%hotmail.com\")"
},
{
"#operator": "InitialProject",
"distinct": true,
"result_terms": [
{
"expr": "((`couchmusic2`.`address`).`countryCode`)"
}
]
},
{
"#operator": "Distinct"
},
{
"#operator": "FinalProject"
}
]
}
},
{
"#operator": "Distinct"
}
]
},
"text": "\nSELECT DISTINCT address.countryCode \nFROM couchmusic2 \nWHERE email LIKE \"%hotmail.com\";"
}
]
Why is the query so slow? How do I speed this query up?
The count Query uses covered index.
Try the following index for DISTINCT Query.
CREATE INDEX ix1 ON couchmusic2(email,address.countryCode);
LIKE with leading % needs to complete indexScan. Check this out https://dzone.com/articles/a-couchbase-index-technique-for-like-predicates-wi
For pattern matching for all the strings ENDING with hotmail.com, do the following:
CREATE INDEX ix ON couchmusic2(SUBSTR(email, -11, 11), address.countryCode);
Modify the LIKE predicate to: WHERE SUBSTR(email, -11, 11) = "hotmail.com";
Obviously, this is suitable only for hotmail.com, you'll need another index.
Checkout TOKENS() function for more flexible way to index this.
To get the distinct values (when you have VERY large number of items compared to number of distinct values), try out the MIN() optimization along with it.
https://dzone.com/articles/count-amp-group-faster-using-n1ql