Couchbase N1QL query generally slow - couchbase

Im using couchbase for quite some time, but I never really experienced couchbase to be real fast. Its rather exceptionally slow.
I wonder what setting am I missing.
I have a root Server with following specs
Intel® Xeon® E5-2680V4 (4 Cores)
12 GB DDR4 ECC
60 GB SSD
Im running Couchbase 4.5.1-2844 Community Edition (build-2844)
with 7.05GB RAM Allocated
The bucket has 1 Data Node and uses 4.93 GB with 3,093,889 Documents.
The bucket Type is "Couchbase" with Cache Metadata set to "Value Ejection". Replicas are disabled. Disk I/O Optimization is set to Low. Flushing is not enabled.
All 3 million documents look smiliar to this one:
{
"discovered_by": 0,
"color": "FFBA00",
"updated_at": "2018-01-18T21:40:17.361Z",
"replier": 0,
"message": "Irgendwas los hier in Luckenwalde?🤔",
"children": "",
"view_count": 0,
"post_own": "FFBA00",
"user_handle": "oj",
"vote_count": [
{
"timestamp": "2018-01-19 09:48:48",
"votes": 0
}
],
"child_count": 3,
"share_count": 0,
"oj_replied": false,
"location": {
"loc_coordinates": {
"lat": 0,
"lng": 0
},
"loc_accuracy": 0,
"country": "",
"name": "Luckenwalde",
"city": ""
},
"tags": [],
"post_id": "59aef043f087270016dc5836",
"got_thanks": false,
"image_headers": "",
"cities": [
"Luckenwalde"
],
"pin_count": 0,
"distance": "friend",
"image_approved": false,
"created_at": "2017-09-05T18:43:15.904Z",
"image_url": ""
}
And a query could look like this
select COUNT(*) from sauger where color = 'FFBA00'
without an Index it fails to execute (timeout) via the couchbase-webapp, but with an index
CREATE INDEX color ON sauger(color)
the result takes up to 16 seconds, but after some tries it takes 2 to 3 seconds each time.
There are 6 different "Color-Strings" (like "FFBA00") and the result of the query is 466920 (which is a 6th of the total documents)
An Explain of above said query gives me this:
[
{
"plan": {
"#operator": "Sequence",
"~children": [
{
"#operator": "IndexCountScan",
"covers": [
"cover ((`sauger`.`color`))",
"cover ((meta(`sauger`).`id`))"
],
"index": "color",
"index_id": "cc3524c6d5a8ef94",
"keyspace": "sauger",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"\"FFBA00\""
],
"Inclusion": 3,
"Low": [
"\"FFBA00\""
]
}
}
],
"using": "gsi"
},
{
"#operator": "IndexCountProject",
"result_terms": [
{
"expr": "count(*)"
}
]
}
]
},
"text": "select COUNT(*) from sauger where color = 'FFBA00'"
}
]
Everything is set up correctly, but still such simple queries take awfully long (and there is nothing else writing or reading from the database, and the server its running on is totally idle)

Make sure you don't have a primary index. That will consume a lot of the index service's memory. Your statement saying the query times out without the index makes me think there's a primary index, otherwise the query would fail immediately.
Edit: Adding more details on Primary Indexes from the Indexing Best Practices blog post
Avoid Primary Keys in Production
Unexpected full primary scans are a possible and any possibility of such occurrences should be removed by avoiding primary indexes altogether in Production. N1QL Index Selection is a rule based system for now that checks for a possible index that will satisfy the query, and if there is no such, then it resorts to using the Primary Index. Primary index has all the keys of the documents, and hence query will fetch all keys from the primary index and then hop to Data Service to fetch the documents and then apply filters. As you can see, this is a very expensive operation and should be avoided at all costs.
If there are no Primary Indexes created, and the query is not able to find a matching index to serve the query, then the Query Service errors with the following message. This is helpful and should help you in creating the required Secondary index suitably:
“No index available on keyspace travel-sample that matches your query. Use CREATE INDEX or CREATE PRIMARY INDEX to create an index, or check that your expected index is online.”

Related

PostgreSQL jsonb_set function

using postgresql db for persistence. one of my table's column's data type is json and stored data format is like
{
"Terms": [
{
"no": 1,
"name": "Vivek",
"salary": 123
},
{
"no": 2,
"name": "Arjun",
"salary": 123
},
{
"no":3,
"name": "Ashok",
"salary": 123
}
]
}
I need to update any of no or name or salary of 1st Term object only.
Used native queried to load and for better performance, I should use native query only for UPDATE. I tried postgresql jsonb_set function for the update, but unable to update.
I tried:
UPDATE table_name
SET terms = jsonb_set(terms->'Terms','{0,name}','"VVVV"',FALSE)
WHERE some condition
and response message in pgAdmin tool is
Query returned successfully: 0 rows affected, x msec execution time.
Can any one help me with this one?

Couchbaselite change Changes returns all objects even though only one document was changed

I am working on a couchbase lite driven app and trying to do live query based on this help from couchbase mobile lite.
While it works, I am confused on the number of documents that reported as changed. This is only in my laptop so I uploaded json file to couchbase server via cbimport. Then sync gateway did sync all the data succesfully to my android app.
Now, I changed one document in couchbase server but all 27 documents are returned as changed in the live query. I was expecting only the document I have changed to be returned as changed since the last sync time.
Looking at the meta information of each document, the document I have changed have the following:
{
"meta": {
"id": "Group_2404_159_5053",
"rev": "15-16148876737400000000000002000006",
"expiration": 0,
"flags": 33554438,
"type": "json"
},
"xattrs": {
"_sync": {
"rev": "7-ad618346393fa2490359555e9c889876",
"sequence": 2951,
"recent_sequences": [
2910,
2946,
2947,
2948,
2949,
2950,
2951
],
"history": {
"revs": [
"3-89bb125a9bb1f5e8108a6570ffb31821",
"4-71480618242841447402418fa1831968",
"5-4c4d990af34fa3f53237c3faafa85843",
"1-4fbb4708f69d8a6cda4f9c38a1aa9570",
"6-f43462023f82a12170f31aed879aecb2",
"7-ad618346393fa2490359555e9c889876",
"2-cf80ca212a3279e4fc01ef6ab6084bc9"
],
"parents": [
6,
0,
1,
-1,
2,
4,
3
],
"channels": [
null,
null,
null,
null,
null,
null,
null
]
},
"cas": "0x0000747376881416",
"value_crc32c": "0x8c664755",
"time_saved": "2020-06-01T14:23:30.669338-07:00"
}
}
}
while the rest 26 documents are similar to this one:
{
"meta": {
"id": "Group_2404_159_5087",
"rev": "2-161344efd90c00000000000002000006",
"expiration": 0,
"flags": 33554438,
"type": "json"
},
"xattrs": {
"_sync": {
"rev": "1-577011ccb4ce61c69507ba44985ca038",
"sequence": 2934,
"recent_sequences": [
2934
],
"history": {
"revs": [
"1-577011ccb4ce61c69507ba44985ca038"
],
"parents": [
-1
],
"channels": [
null
]
},
"cas": "0x00000cd9ef441316",
"value_crc32c": "0xc37bb792",
"time_saved": "2020-05-28T11:34:50.3200745-07:00"
}
}
}
Is that the expected behavior or there is something I can do about it?
That behavior is as expected. The live query re-runs the query every time there is a database change that impacts the results of the query. So in your case, since it's a query that fetches ALL documents in your database, the query re-runs when any document in database changes and it returns all documents (which is what the query is for).
Live queries are best suited if you have a filter predicate on your query. For instance, if the app wants to be notified if the status field in documents of type "foo" changes. in that case, you will only be notified if the status field changes in document of type "foo".
In your case, if you just care about changes if any of the document in your database changes, you should just use a Database Change Listener

Postgres: Why do smaller JSON documents take more disk space than a larger ones?

I am testing how much storage space JSON documents, stored as JSONB in Postgres 9.6.4, would take for our application. The tests produced odd results and I'm hoping someone on SO might have an explaination.
Below is a sample of the JSON document our application stores. In the sample there are five sets of key/value pairs (valueA - valueE), however the real data contains 10-30 sets of key/values.
During my testing I created three tablea and each one persisted 100,000 documents. Each table persisted a progressively larger JSON document of 10, 20 and 30 sets of key/values.
What I cannot explain is the table with the smallest JSON document took MUCH more space than the larger JSON documents.
There are no repeating keys and I've inspected the data in all tables and the entire JSON document is stored for all test scenarios. In every test I create a simple, brand new table without any indexes and then load the JSON documents.
Here are my results:
100,000 JSON Documents with
10 sets of key/value pairs: took 195mb of space
20 sets of key/value pairs: took 71mb of space
30 sets of key/value pairs: took 87mb of space
Does anyone know what would explain this?
Sample JSON Document:
{
"values": {
"valueA": {
"created": "2017-08-29T22:22:13",
"name": "targetA",
"rank": 1.136,
"valuetype": "expected"
},
"valueB": {
"created": "2017-08-29T22:22:14",
"name": "targetB",
"rank": 0.067,
"valuetype": "expected"
},
"valueC": {
"created": "2017-08-29T22:22:15",
"name": "targetC",
"rank": 0.42,
"valuetype": "expected"
},
"valueD": {
"created": "2017-08-29T22:22:16",
"name": "targetD",
"rank": 0.986,
"valuetype": "random"
}
"valueE": {
"created": "2017-08-29T22:22:16",
"name": "targetE",
"rank": 0.111,
"valuetype": "random"
}
}
}

Cloudant Selector Query

I would like to query using cloudant db using selector, for example that is shown below: user would like to have loanborrowed whose amount exceeds a number, how to access the array in a cloudant selector to find a specific record
{
"_id": "65c5e4c917781f7365f4d814f6e1665f",
"_rev": "2-73615006996721fef9507c2d1dacd184",
"userprofile": {
"name": "tom",
"age": 30,
"employer": "Microsoft"
},
"loansBorrowed": [
{
"loanamount": 5000,
"loandate": "01/01/2001",
"repaymentdate": "01/01/2001",
"rateofinterest": 5.6,
"activeStatus": true,
"penalty": {
"penalty-amount": 500,
"reasonforPenalty": "Exceeded the date by 10 days"
}
},
{
"loanamount": 3000,
"loandate": "01/01/2001",
"repaymentdate": "01/01/2001",
"rateofinterest": 5.6,
"activeStatus": true,
"penalty": {
"penalty-amount": 400,
"reasonforPenalty": "Exceeded the date by 10 days"
}
},
{
"loanamount": 2000,
"loandate": "01/01/2001",
"repaymentdate": "01/01/2001",
"rateofinterest": 5.6,
"activeStatus": true,
"penalty": {
"penalty-amount": 500,
"reasonforPenalty": "Exceeded the date by 10 days"
}
}
]
}
If you use the default Cloudant Query index (type text, index everything):
{
"index": {},
"type": "text"
}
Then the following query selector should work to find e.g. all documents with a loanamount > 1000:
"loansBorrowed": { "$elemMatch": { "loanamount": { "$gt": 1000 } } }
I'm not sure that you can coax Cloudant Query to only index nested fields within an array so, if you don't need the flexibility of the "index everything" approach, you're probably better off creating a Cloudant Search index which indexes just the specific fields you need.
While Will's answer works, I wanted to let you know that you have other indexing options with Cloudant Query for handling arrays. This blog has the details on various tradeoffs (https://cloudant.com/blog/mango-json-vs-text-indexes/), but long story short, I think this might be the best indexing option for you:
{
"index": {
"fields": [
{"name": "loansBorrowed.[].loanamount", "type": "number"}
]
},
"type": "text"
}
Unlike Will's index-everything approach, here you're only indexing a specific field, and if the field contains an array, you're also indexing every element in the array. Particularly for "type": "text" indexes on large datasets, specifying a field to index will save you index-build time and storage space. Note that text indexes that specify a field must use the following form in the "fields": field: {"name": "fieldname", "type": "boolean,number, or string"}
So then the corresponding Cloudant Query "selector": statement would be this:
{
"selector": {
"loansBorrowed": {"$elemMatch": {"loanamount": {"$gt": 4000}}}
},
"fields": [
"_id",
"userprofile.name",
"loansBorrowed"
]
}
Also note that you don't have to include "fields": as part of your "selector": statement, but I did here to only project certain parts of the JSON. If you omit it from your "selector": statement, the entire document will be returned.

Best database type and schema to search by attributes

I know that this question may not have an easy answer or at least many possible correct ones.
I am developing a weather web app to search cities by summary, temperature, humidity, precipitation, wind speed, visibility, pressure and some other weather indicators. I will also include the weather station set up that, for making things easier, let’s consider it is unique in every city. I would also like to include some city data such as: population, afforestation index as well as latitude, longitude.
Continent, Country and Region will also be needed.
Weather station will include the model number of every sensor installed in it.
There will be around 5.000 cities.
Most used query will be to search the cities by a temperature, humidity, precipitation, wind speed, visibility and pressure range as well as filtering by population, etc. and weather station sensor model name.
A query would look like:
summary = “Clear”
and temperature > 6 and temperature < 10
and pressure > 900 and pressure <1000
and visibility > 5 and visibility < 7
and humidity > 0.60 and humidity < 0.90
and population is > 20.000
and afforestation index is > 3
and country = France
and “sensor1” = “string”
The question is: What database type and schema fit the best my search needs regarding to performance? As you can see I need to search by attributes and not by the city name itself. I am completely free to use Relational or NoSQL database rather that I would like to use an asynchronous system.
I don’t know if a NoSQL db like MongoDB is intended to be used like this, if this is the case, would this schema be fast enough? I am worried as everything is nested and indexes can be huge.
"continents":
[
{
"name": "Europe",
"countries":
[
{
"name": "France",
"regions":
[
{
"name": "Île-de-France"
"cities":
[
{
"name": "Paris",
"coordinates": {"lat": 48.856614, "lon": 2.352222},
"summary":"Clear",
"temperature": 9.4,
"pressure": 976,
"visibility" : 6.8,
"humidity" : 0.84,
"afforestation": 6,
"population": 2249975,
...
"weather_station": {
"name": "name",
"sensor 1": "string",
"sensor 2": "string",
"sensor 3": "string",
"sensor 4": "string",
}
},
...
]
},
...
]
},
...
]
},
...
]
I guess this use case has been developed in many other apps that require a search by element attributes.
Oh! I forgot to say that I am using Python and Tornado web framework.
Many thanks for your help!
The Following Schema May be what you are Looking for.
Note that in document DB's you will need to denormalize your data slightly to match the way its accessed the most
this would be 1 row in a City Collection
{
"City": "Paris",
"coordinates": {"lat": 48.856614, "lon": 2.352222},
"summary":"Clear",
"temperature": 9.4,
"pressure": 976,
"visibility" : 6.8,
"humidity" : 0.84,
"afforestation": 6,
"population": 2249975,
...
"weather_station": {
"name": "name",
"sensor 1": "string",
"sensor 2": "string",
"sensor 3": "string",
"sensor 4": "string",
}
"region": "Île-de-France",
"country":"France",
"continent":"Europe"
}
5000 rows in one table? About 20 metrics? No "history"?
Make a single table with 5000 rows and 20 columns. No INDEXes other than the minimal PRIMARY KEY for UPDATEing a row when a weather station reports in. Build a SELECT from the desired conditions, then let to optimizer do a full table scan.
Everything will stay in RAM, and the SELECTs will be "brute force". It should take only a few milliseconds. (I ran a similar SELECT on a 2.7M-row table; it took 1.3 seconds.)
If you are keeping history, then we need to talk further.