Couchbase DISTINCT very slow - couchbase

I'm working through the free CB110 course on N1QL offered at learn.coucbase.com.
Following query in course's accompanying workbook takes 1 minute:
SELECT DISTINCT address.countryCode
FROM couchmusic2
WHERE email LIKE "%hotmail.com";
I have an gsi on email.
Following query takes milliseconds:
SELECT COUNT(*)
FROM couchmusic2
WHERE email LIKE "%hotmail.com";
which has me believe that DISTINCT is the problem.
EXPLAIN reveals this:
[
{
"plan": {
"#operator": "Sequence",
"~children": [
{
"#operator": "IndexScan",
"index": "idx_email",
"index_id": "c2e612a0d697d8b6",
"keyspace": "couchmusic2",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"[]"
],
"Inclusion": 1,
"Low": [
"\"\""
]
}
}
],
"using": "gsi"
},
{
"#operator": "Fetch",
"keyspace": "couchmusic2",
"namespace": "default"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Filter",
"condition": "((`couchmusic2`.`email`) like \"%hotmail.com\")"
},
{
"#operator": "InitialProject",
"distinct": true,
"result_terms": [
{
"expr": "((`couchmusic2`.`address`).`countryCode`)"
}
]
},
{
"#operator": "Distinct"
},
{
"#operator": "FinalProject"
}
]
}
},
{
"#operator": "Distinct"
}
]
},
"text": "\nSELECT DISTINCT address.countryCode \nFROM couchmusic2 \nWHERE email LIKE \"%hotmail.com\";"
}
]
Why is the query so slow? How do I speed this query up?

The count Query uses covered index.
Try the following index for DISTINCT Query.
CREATE INDEX ix1 ON couchmusic2(email,address.countryCode);
LIKE with leading % needs to complete indexScan. Check this out https://dzone.com/articles/a-couchbase-index-technique-for-like-predicates-wi

For pattern matching for all the strings ENDING with hotmail.com, do the following:
CREATE INDEX ix ON couchmusic2(SUBSTR(email, -11, 11), address.countryCode);
Modify the LIKE predicate to: WHERE SUBSTR(email, -11, 11) = "hotmail.com";
Obviously, this is suitable only for hotmail.com, you'll need another index.
Checkout TOKENS() function for more flexible way to index this.
To get the distinct values (when you have VERY large number of items compared to number of distinct values), try out the MIN() optimization along with it.
https://dzone.com/articles/count-amp-group-faster-using-n1ql

Related

Elasticsearch - How to get the latest record in each group with filter?

I have a few records in elastic search I want to group the record by user_id and fetch the latest record which is event_type is 1
If the latest record event_type value is not 1 then we should not fetch that record. I did it in MySQL query. Please let me know how can I do that same in elastic search.
After executing the MySQL query
SELECT * FROM user_events
WHERE id IN( SELECT max(id) FROM `user_events` group by user_id ) AND event_type=1;
I need the same output in elasticsearch aggregations.
Elasticsearch Query:
GET test_analytic_report/_search
{
"from": 0,
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"event_date": {
"gte": "2022-10-01",
"lte": "2023-02-06"
}
}
}
]
}
},
"sort": {
"event_date": {
"order": "desc"
}
},
"aggs": {
"group": {
"terms": {
"field": "user_id"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"_source": ["user_id", "event_date", "event_type"],
"sort": {
"user_id": "desc"
}
}
}
}
}
}
}
I have the above query I have two users whose user_id is 55 and 56. So, in my aggregations, it should not come. But It fetched the other event_type data but I want only event_types=1 with the latest one. if the user's last record does not have event_type=1, it should not come.
In the above table, user_id 56 latest record event_type contains 2 so it should not come in our aggregations.
I tried but it's not returning the exact result that I want.
Note: event_date is the current date and time. As per the above image, I have inserted it manually that's why the date differs
GET user_events/_search
{
"size": 1,
"query": {
"term": {
"event_type": 1
}
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
Explanation: This is an Elasticsearch API request in JSON format. It retrieves the latest event of type 1 (specified by "event_type": 1 in the query) from the "user_events" index, with a size of 1 (specified by "size": 1) and sorts the results in descending order by the "id" field (specified by "order": "desc" in the sort).
If your ES version supports, you can do it with field collapse feature. Here is an example query:
{
"_source": false,
"query": {
"bool": {
"filter": {
"term": {
"event_type": 1
}
}
}
},
"collapse": {
"field": "user_id",
"inner_hits": {
"name": "the_record",
"size": 1,
"sort": [
{
"id": "desc"
}
]
}
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
In the response, you will see that the document you want is in inner_hits under the name you give. In my example it is the_record. You can change the size of the inner hits if you want more records in each group and sort them.
Tldr;
They are many ways to go about it:
Sorting
Collapsing
Latest Transform
All those solution are approximate of what you could get with sql.
But my personal favourite is transform
Solution - transform jobs
Set up
We create 2 users, with 2 events.
PUT 75324839/_bulk
{"create":{}}
{"user_id": 1, "type": 2, "date": "2015-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 1, "type": 1, "date": "2016-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 2, "type": 1, "date": "2015-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 2, "type": 2, "date": "2016-01-01T00:00:00.000Z"}
Transform job
This transform job is going to run against the index 75324839.
It will find the latest document, with regard to the user_id, based of the value in date field.
And the results are going to be stored in latest_75324839.
PUT _transform/75324839
{
"source": {
"index": [
"75324839"
]
},
"latest": {
"unique_key": [
"user_id"
],
"sort": "date"
},
"dest": {
"index": "latest_75324839"
}
}
If you were to query latest_75324839
You would find:
{
"hits": [
{
"_index": "latest_75324839",
"_id": "AGvuZWuqqz7c5ytICzX5Z74AAAAAAAAA",
"_score": 1,
"_source": {
"date": "2017-01-01T00:00:00.000Z",
"user_id": 1,
"type": 1
}
},
{
"_index": "latest_75324839",
"_id": "AA3tqz9zEwuio1D73_EArycAAAAAAAAA",
"_score": 1,
"_source": {
"date": "2016-01-01T00:00:00.000Z",
"user_id": 2,
"type": 2
}
}
]
}
}
Get the final results
To get the amount of user with type=1.
A simple search query such as:
GET latest_75324839/_search
{
"query": {
"term": {
"type": {
"value": 1
}
}
},
"aggs": {
"number_of_user": {
"cardinality": {
"field": "user_id"
}
}
}
}
Side notes
This transform job has been running in batch, this means it will only run once.
It is possible to run it in a continuous fashion, to get all the time the latest event for a user_id.
Here are some examples.
Your are looking for an SQL HAVING clause, which would allow you to filter results after grouping. But sadly there is nothing equivalent on Elastic.
So it is not possible to
sort, collapse and filter afterwards (even post_filter does not
help here)
use a top_hits aggregation with custom sorting and then filter
use any map/reduce scripted aggregations, as they do not support
sorting.
work with subqueries.
So basically seen, Elastic is not a database. Any sorting or relation to other documents should be based on scoring. And the score should be calculated independently for each document, distributed on shards.
But there is a tiny loophole, which might be the solution for your use case. It is based on a top_metrics aggregation followed by bucket selector to eliminate the unwanted event types:
GET test_analytic_report/_search
{
"size": 0,
"aggs": {
"by_id": {
"terms": {
"field": "user_id",
"size": 100
},
"aggs": {
"tm": {
"top_metrics": {
"metrics": {
"field": "event_type"
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
},
"event_type_filter": {
"bucket_selector": {
"buckets_path": {
"event_type": "tm.event_type"
},
"script": "params.event_type == 1"
}
}
}
}
}
}
If you require more fields from the source document you can add them to the top_metrics.
It is sorted by id now, but you can also use event_date.

Will there be a performance overhead when using an index having Object_Pairs (in case of a covered query) - Couchbase

Suppose I create an index on Object_pair(values).val.data.
Will my index store the “values” field as an array (with elements name for ID and val for data due to object_pair)?
If so, and also if my n1ql query is a covered query (fetching only Object_pair(values).val.data via select clause), will there still be a performance overhead? (because I am under the impression that in the above case, as index would already contain “values” field as an array, no actual object_pair transformation would take place hence avoiding the overhead. Only in the case of a non-covered query will the actual document be accessed and object_pair transformation done on “values” field).
Couchbase document:
"values": {
"item_1": {
"data": [{
"name": "data_1",
"value": "A"
},
{
"name": "data_2",
"value": "XYZ"
}
]
},
"item_2": {
"data": [{
"name": "data_1",
"value": "123"
},
{
"name": "data_2",
"value": "A23"
}
]
}
}
}```
UPDATE:
suppose if we plan to create index on Object_pair(values)[*].val.data & Object_pair(values)[*].name
Index: CREATE INDEX idx01 ON ent_comms_tracking(ARRAY { value.name, value.val.data} FOR value IN object_pairs(values) END)
Query: SELECT ARRAY { value.name, value.val.data} FOR value IN object_pairs(values) END as values_array FROM bucket
Can you please paste your full create index statement?
Creating index on OBJECT_PAIRS(values).val.data indexes nothing.
You can check it out by creating a primary index and then running below query:
SELECT OBJECT_PAIRS(`values`).val FROM mybucket
Output is:
[
{}
]
OBJECT_PAIRS(values) returns arrays of values which contain the attribute name and value pairs of the object values -
SELECT OBJECT_PAIRS(`values`) FROM mybucket
[
{
"$1": [
{
"name": "item_1",
"val": {
"data": [
{
"name": "data_1",
"value": "A"
},
{
"name": "data_2",
"value": "XYZ"
}
]
}
},
{
"name": "item_2",
"val": {
"data": [
{
"name": "data_1",
"value": "123"
},
{
"name": "data_2",
"value": "A23"
}
]
}
}
]
}
]
It's an array, so val of it is not directly referenced

Couchbase 4.5 - Index is not covered when array is used in where clause

I have a Couchbase(4.5) bucket my-data. A minimal overview of the bucket is as follows.
Document structure
{
_class: “com.dom.Activity”,
depId: 984,
dayIds: [17896, 17897, 17898],
startTime: 10,
endTime: 20
}
Index
I also have an index in the bucket as follows.
CREATE INDEX idx_dep_day ON my-data(depId, distinct array i for i in dayIds end, , meta().id) WHERE _class = “com.dom.Activity” and startTime is not null and endTime is not null;
I need to fetch some document ids and I hope to use the index given above for the purpose. Also, I want the result to be covered by the index.
The problem is that the query is not covered when I use the dayIds field in the where clause.
The following are the queries and their explanations I tried
Query-1 (with the dayIds array in where clause)
select meta(b).id from my-data b use index (idx_dep_day)where _class= ‘com.dom.Activity’ and depId = 984 and any i in dayIds satisfies i = 17896 end and startTime is not null and meta().id > ‘Activity-a65e7e616f21e4c6d7b7bccbfd154da1’ and endTime is not null limit 80000
Explain-1
[
{
"plan": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Sequence",
"~children": [
{
"#operator": "DistinctScan",
"scan": {
"#operator": "IndexScan",
"index": "idx_dep_day",
"index_id": "53398c61c49ae09a",
"keyspace": "my-data",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"984",
"17896"
],
"Inclusion": 2,
"Low": [
"984",
"17896",
"\"Activity-a65e7e616f21e4c6d7b7bccbfd154da1\""
]
}
}
],
"using": "gsi"
}
},
{
"#operator": "Fetch",
"as": "b",
"keyspace": "my-data",
"namespace": "default"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Filter",
"condition": "(((((((`b`.`_class`) = \"com.dom.Activity\") and ((`b`.`depId`) = 984)) and any `i` in (`b`.`dayIds`) satisfies (`i` = 17896) end) and ((`b`.`startTime`) is not null)) and (\"Activity-a65e7e616f21e4c6d7b7bccbfd154da1\" < (meta(`b`).`id`))) and ((`b`.`endTime`) is not null))"
},
{
"#operator": "InitialProject",
"result_terms": [
{
"expr": "(meta(`b`).`id`)"
}
]
},
{
"#operator": "FinalProject"
}
]
}
}
]
},
{
"#operator": "Limit",
"expr": "80000"
}
]
},
"text": "select meta(b).id from `my-data` b use index (`idx_dep_day`)where `_class`= 'com.dom.Activity' and depId = 984 and any i in dayIds satisfies i = 17896 end and startTime is not null and \nmeta().id > 'Activity-a65e7e616f21e4c6d7b7bccbfd154da1' and endTime is not null limit 80000"
}
]
But when I remove the array from the where clause
Query -2(without dayIds array in where clause)
select meta(b).id from my-data b use index (idx_dep_day)where _class= ‘com.dom.Activity’ and depId = 984 and startTime is not null and meta().id > ‘Activity-a65e7e616f21e4c6d7b7bccbfd154da1’ and endTime is not null limit 80000
Explain-2
[
{
"plan": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Sequence",
"~children": [
{
"#operator": "DistinctScan",
"scan": {
"#operator": "IndexScan",
"covers": [
"cover ((`b`.`depId`))",
"cover ((distinct (array `i` for `i` in (`b`.`dayIds`) end)))",
"cover ((meta(`b`).`id`))",
"cover ((meta(`b`).`id`))"
],
"filter_covers": {
"cover (((`b`.`endTime`) is not null))": true,
"cover (((`b`.`startTime`) is not null))": true,
"cover ((`b`.`_class`))": "com.dom.Activity"
},
"index": "idx_dep_day",
"index_id": "53398c61c49ae09a",
"keyspace": "core-data-20190221",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"successor(984)"
],
"Inclusion": 1,
"Low": [
"984"
]
}
}
],
"using": "gsi"
}
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Filter",
"condition": "(((((cover ((`b`.`_class`)) = \"com.dom.Activity\") and (cover ((`b`.`depId`)) = 984)) and cover (((`b`.`startTime`) is not null))) and (\"Activity-a65e7e616f21e4c6d7b7bccbfd154da1\" < cover ((meta(`b`).`id`)))) and cover (((`b`.`endTime`) is not null)))"
},
{
"#operator": "InitialProject",
"result_terms": [
{
"expr": "cover (meta(`b`).`id`))"
}
]
},
{
"#operator": "FinalProject"
}
]
}
}
]
},
{
"#operator": "Limit",
"expr": "80000"
}
]
},
"text": "select meta(`b`).`id` from \n`my-data` b use index (`idx_dep_day`)where \n`_class`= 'com.dom.Activity' and depId = 984 and startTime is not null and meta().id > 'Activity-a65e7e616f21e4c6d7b7bccbfd154da1' and endTime is not null limit 80000"
}
]
Why can’t I get the index covering when I use the dayIds array in the where clause?
Finally, I could solve the issue. Turns out that we need to add the array as a scalar in the index for the covering to work.
CREATE INDEX idx_dep_day ON my-data(depId, distinct array i for i in dayIds end, meta().id, dayIds) WHERE _class = “com.dom.Activity” and startTime is not null and endTime is not null;
Now it works fine and the following is the result
Query
explain select meta(b).id from my-data b use index (idx_dep_day)where _class= ‘com.dom.Activity’ and depId = 984 and any i in dayIds satisfies i = 17896 end and startTime is not null and meta().id > ‘Activity-2’ and endTime is not null limit 80000
Output
[
{
"plan":{
"#operator":"Sequence",
"~children":[
{
"#operator":"Sequence",
"~children":[
{
"#operator":"DistinctScan",
"scan":{
"#operator":"IndexScan",
"covers":[
"cover ((b.depId))",
"cover ((distinct (array i for i in (b.dayIds) end)))",
"cover ((meta(b).id))",
"cover ((b.dayIds))",
"cover ((meta(b).id))"
],
"filter_covers":{
"cover (((b.endTime) is not null))":true,
"cover (((b.startTime) is not null))":true,
"cover ((b._class))":"com.dom.Activity"
},
"index":"idx_dep_day",
"index_id":"cb0adb18bf0f081f",
"keyspace":"test",
"namespace":"default",
"spans":[
{
"Range":{
"High":[
"984",
"17896"
],
"Inclusion":2,
"Low":[
"984",
"17896",
"\"Activity-2\""
]
}
}
],
"using":"gsi"
}
},
{
"#operator":"Parallel",
"~child":{
"#operator":"Sequence",
"~children":[
{
"#operator":"Filter",
"condition":"((((((cover ((b._class)) = \"com.dom.Activity\") and (cover ((b.depId)) = 984)) and any i in cover ((b.dayIds)) satisfies (i = 17896) end) and cover (((b.startTime) is not null))) and (\"Activity-2\" < cover ((meta(b).id)))) and cover (((b.endTime) is not null)))"
},
{
"#operator":"InitialProject",
"result_terms":[
{
"expr":"cover ((meta(b).id))"
}
]
},
{
"#operator":"FinalProject"
}
]
}
}
]
},
{
"#operator":"Limit",
"expr":"80000"
}
]
},
"text":"select meta(b).id from\ntest b use index (idx_dep_day)where _class= ‘com.dom.Activity’ and depId = 984\nand any i in dayIds satisfies i = 17896 end and startTime is not null and\nmeta().id > ‘Activity-2’ and endTime is not null limit 80000"
}
]

ElasticSearch filtered query with operator AND and OR

I'm intervening on an existing app which interacts with an elasticsearch sever and i'm seeing some weird responses, probably due to the fact that i'm new to elastic.
I have the indexed item below :
"_id": "59773d268770541557000012",
"_score": 0.03282923,
"_source": {
"_id": "59773d268770541557000012",
"active": null,
"address": "dummy address",
"center_ids": [],
"consultation_site_ids": [],
"coordinates": null,
"created_at": "2017-07-25T14:44:22.270+02:00",
"death_declaration_form_step_id": "56ddb086f0e0103b44000000",
"end_of_pregnancy_form_step_id": "56c34e63f0e0105e65000000",
"fax": "06.95.40.58.84",
"form_step_ids": [
"55361b215342491667030000",
"5541f16252f131f6a125a375",
"55361ba05342491667040000",
"553610835342491667010000",
"55361d225342491667050000",
"5541f34a52f131f6a125a377"
],
"hospital_id": "57c004905c5393772c002a62",
"name": "test site d'encronologie",
"phone": "06.95.40.58.84",
"short_name": "test site d'encronologie d'endcronologie",
"sites_union_ids": [],
"state": "active",
"updated_at": "2017-07-25T14:44:22.270+02:00",
"url": "http://www.testurl.com",
"user_ids": [],
"warnings_threshold": null,
"_type": "Site
AND I am querying the server with this query:
"query":{
"filtered":{
"query":{
"bool":{
"should":[
{
"multi_match":{
"fields":[
"name^5",
"name.edge^1",
"name.full^0.3"
],
"query":"enc",
"type":"cross_fields"
}
},
{
"match":{
"name":{
"query":"enc",
"type":"phrase_prefix",
"operator":"or"
}
}
},
{
"match":{
"name":{
"query":"enc",
"type":"boolean",
"boost":5
}
}
}
]
}
},
"filter":{
"and":[
{
"term":{
"hospital_id":"57c004905c5393772c002a62"
}
},
{
"term":{
"state":"active"
}
}
]
}
}
}}
Which returns nothing (no hits)
And the other hand, if I change the filter operator "AND" to "OR" I recieve my 1 hit.
I am talking about the "and" on the "filter" branch :
"filter":{
"and":[
I realy don't understand how come OR works but not AND?
Then again when I change my query term from "enc" to "zzz_enc" in all the query{} of the first branch WHILE keeping the "OR" I have zero matches, even though the filter condition hospital_id and state are true on my item.
Why does the filter operator behave like this ?
Thank you in advance.

LEFT OUTER JOIN + WHERE clause in Couchbase

I am trying to perform a LEFT OUTER JOIN while filtering on the right part of the join.
I have created the following index to achieve this:
CREATE INDEX `idx_store_order` ON `myBucket`(("Store::" || `storeId`)) WHERE ((`docType` = "Order") or (`docType` is missing))
and I am trying to execute the following query:
SELECT store.status, order.clientId, store.docId
FROM myBucket store
LEFT OUTER JOIN myBucket order ON KEY ("Store::" || order.storeId) FOR store
WHERE store.docType="Store"
AND (order.docType="Order" OR order.docType IS MISSING)
AND order.clientId="9281ae36-a418-4ea3-93f0-bfd7b1a38248"
I have 30 documents with docType="Store", but when I perform this query I don't get the 30 results. If I remove the last clause and group by store, then I get the 30 results, so it's the last clause that affects the final results.
I have also tried the following statement (unsucessfully) as the last clause:
(AND order.clientId="9281ae36-a418-4ea3-93f0-bfd7b1a38248" OR order.docType IS MISSING)
Am I missing something? Thanks
EDIT
Here's the explain query:
[
{
"plan": {
"#operator": "Sequence",
"~children": [
{
"#operator": "IndexScan",
"index": "idx_docType",
"index_id": "e498d0c0ee2f0d9d",
"keyspace": "myBucket",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"\"Store\""
],
"Inclusion": 3,
"Low": [
"\"Store\""
]
}
}
],
"using": "gsi"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Fetch",
"as": "store",
"keyspace": "myBucket",
"namespace": "default"
},
{
"#operator": "IndexJoin",
"as": "order",
"for": "store",
"keyspace": "myBucket",
"namespace": "default",
"on_key": "(\"Store::\" || (`order`.`storeId`))",
"outer": true,
"scan": {
"index": "idx_store_order",
"index_id": "a97fce5158e6e573",
"using": "gsi"
}
},
{
"#operator": "Filter",
"condition": "((((`store`.`docType`) = \"Store\") and (((`order`.`docType`) = \"Order\") or ((`order`.`docType`) is missing))) and (((`order`.`clientId`) = \"9281ae36-a418-4ea3-93f0-bfd7b1a138248\") or (`order` is missing)))"
},
{
"#operator": "InitialProject",
"result_terms": [
{
"expr": "(`store`.`status`)"
}
]
},
{
"#operator": "FinalProject"
}
]
}
}
]
},
"text": "SELECT store.status\nFROM myBucket store\nLEFT OUTER JOIN myBucket order ON KEY (\"Store::\" || order.storeId) FOR store\nWHERE store.docType=\"Store\"\nAND (order.docType=\"Order\" OR order.docType IS MISSING)\nAND (order.clientId=\"9281ae36-a418-4ea3-93f0-bfd7b1a138248\" OR order IS MISSING)"
}
]
EDIT2
As discussed in the comments, I want to list all stores, regardless of a given customer having orders in it or not. If the customer does have orders, then I want to show certain fields along with the list of stores.
E.g.
Store 1 - Client X does not have orders
Store 2 - Client X does have one order, and some information is shown along the store info
Outer joins produce all left side documents irrespective of successfully matching the join-key predicate (and not any condition in your where-clause). That means, you get 30 results whether you have matching order.storeId or not.
In this case, the last filter is on client-ID, which is applied post JOIN, and hence is filtering some documents. Check/post the EXPLAIN output to validate.
In N1QL currently, WHERE clause is not considered part of the JOIN predicate, so you have to do the following. You need to escape order throughout, or use a different alias.
SELECT store.status, order.userId, store.docId
FROM myBucket store
LEFT OUTER JOIN myBucket order ON KEY ("Store::" || order.storeId) FOR store
WHERE store.docType="Store"
AND (
(order IS MISSING)
OR
((order.docType="Order" OR order.docType IS MISSING)
AND order.clientId="9281ae36-a418-4ea3-93f0-bfd7b1a38248")