N1QL query spanning multiple buckets - couchbase

Two Couchbase related questions...
I am using Java SDK 2.2.0 on Couchbase 4.0 RC0 to run some N1QL queries which joins more than one bucket. In Java SDK, query is a functionality exposed by the bucket interface. So, if I want to run a N1QL query joining more than one bucket which bucket should I get a handle for (i.e. which bucket name should I pass when invoking Cluster.openBucket(...)). Operations like insert, upsert, delete etc being tied to a bucket makes sense because they working on a document in a bucket but shouldn't a query be more generic?
Does CouchbaseCluster.create() and Cluster.disconnect() create the necessary connections to the cluster? If so, what does opening and closing a bucket do?

It is true that N1QL is a bit less tied to the Bucket than the rest of the operations in the API, but we went with adding the query method there, because we figured most people already using the SDK would be accustomed to deal with the Bucket and probably a lot of N1QL use cases will only span 1 bucket.
However to answer your question, it doesn't matter which Bucket reference you use, both will work.
Cluster.create() will compile the list of seed nodes to bootstrap from and prepare the ConfigurationManager so that the SDK can receive updates from the cluster. The actual connection, authentication dance and establishment of main resources is done when calling openBucket.

Related

How to synchronize MySQL database with Amazon OpenSearch service

I am new to Amazon OpenSearch service, and i wish to know if there's anyway i can sync MySQL db with Opensearch on real time. I thought of Logstash but it seems like it doesn't support delete , update operations which might not update my OpenSearch cluster
I'm going to comment for Elasticsearch as that is the tag used for this question.
You can:
Read from the database (SELECT * from TABLE)
Convert each record to a JSON Document
Send the json document to elasticsearch, preferably using the _bulk API.
Logstash can help for that. But I'd recommend modifying the application layer if possible and send data to elasticsearch in the same "transaction" as you are sending your data to the database.
I shared most of my thoughts there: http://david.pilato.fr/blog/2015/05/09/advanced-search-for-your-legacy-application/
Have also a look at this "live coding" recording.
Side note: If you want to run Elasticsearch, have look at Cloud by Elastic, also available if needed from AWS Marketplace, Azure Marketplace and Google Cloud Marketplace.
Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, Maps UI, Alerting and built-in solutions named Observability, Security, Enterprise Search and what is coming next :) ...
Disclaimer: I'm currently working at Elastic.
Keep a column that indicates when the row was last modified, then you will be able to do updates to OpenSearch. Similarly for deleting, just have a column indicating whether it is deleted or not (soft delete), and the date it was deleted.
With this db design, you can send the "delete" or "update" actions to OpenSearch/ElasticSearch to update/delete the indexes based on the last modified / deleted date. You can later have a scheduled maintenance job to delete these rows permanently from the database table.
Lastly, this article might be of help to you How to keep Elasticsearch synchronized with a relational database using Logstash and JDBC

Using Couchbase SDK vs Sync Gateway API

I have a full deployment of couchbase (server, sync gateway and lite) and have an API, mobile app and web app all using it.
It works very well, but I was wondering if there are any advantages to using the Sync Gateway API over the Couchbase SDK? Specifically I would like to know if Sync Gateway would handle larger numbers of operations better than the SDK, perhaps an internal queue/cache system, but can't seem to find definitive documentation for this.
At the moment the API uses the C# Couchbase SDK and we use SyncGateway very little (only really for synchronising the mobile app).
First, some relevant background info :
Every document that needs to be synced over to Couchbase Lite(CBL) clients needs to be processed by the Sync Gateway (SGW). This is true whether a doc is written via the SGW API or whether it comes in via server write (N1QL or SDK). The latter case is referred to as "import processing” wherein the document that is written to the bucket (via N1QL) is read by SGW via DCP feed. The document is then processed by SGW and written back to the bucket with the relevant sync metadata.
Prerequisite :
In order for the SGW to import documents written directly via N1QL/SDK, you must enable “shared bucket access” and import processing as discussed here
Non-mobile documents :
If you have documents that are never going to be synced to the CBL clients, then choice is obvious. Use server SDKs or N1QL
Mobile documents (docs to sync to CBL clients) :
Assuming you are on SGW 2.x syncing with CBL 2.x clients
If you have documents written at server end that need to be synced to CBL clients, then consider the following
Server side write rate:
If you are looking at writes on server side coming in at sustained rates significantly exceeding 1.5K/sec (lets say 5K/sec), then you should go the SGW API route. While it's easy enough to do a bulk update via server N1QL query, remember that SGW still needs to keep up and do the import processing (what's discussed in the background).
Which means, if you are doing high volume updates through the SDK/N1QL, then you will have to rate limit it so the SGW can keep up (do batched updates via SDK)
That said, it is important to consider the fact that if SGW can't keep up with the write throughput on the DCP feed, it's going to result in latency, no matter how the writes are happening (SGW API or N1QL)
If your sustained write rate on server isn’t excepted to be significantly high, then go with N1QL.
Deletes Handling:
Does not matter. Under shared-bucket-access, deletes coming in via SDK or SGW API will result in a tombstone. Read more about it here
SGW specific config :
Naturally, if you are dealing with SGW specific config, creating SGW users, roles, then you will use the SGW API for that.
Conflict Handling :
In 2.x, it does not matter. Conflicts are handled on CBL side.
Challenge with SGW API
Probably the biggest challenge in a real-world scenario is that using the SG API path means either storing information about SG revision IDs in the external system, or perform every mutation as a read-then-write (since we don't have a way to PUT a document without providing a revision ID)
The short answer is that for backend operations, Couchbase SDK is your choice, and will perform much better. Sync Gateway is meant to be used by Mobile clients, with few exceptions (*).
Bulk/Batch operations
In my performance tests using Java Couchbase SDK and bulk operations from AsyncBucket (link), I have updated up to 8 thousand documents per second. In .Net there you can do Batch operations too (link).
Sync Gateway also supports bulk operations, yet it is much slower because it relies on REST API and it requires you to provide a _rev from the previous version of each document you want to update. This will usually result in the backend having to do a GET before doing a PUT. Also, keep in mind that Sync Gateway is not a storage unit. It just works as a proxy to Couchbase, managing mobile client access to segments of data based on the channels registered for each user, and writes all of it's meta-data documents into the Couchbase Server bucket, including channel indexing, user register, document revisions and views.
Querying
Views are indexed thus for querying of large data they may will respond very fast. Whenever a document is changed, the map function of all views has the opportunity to map it. But when a view is created through Sync Gateway REST API, some code is added to your map function to handle user channels/permissions, making it slower than plain code created directly in Couchbase Admin UI. Querying views with compound keys using startKey/endKey parameters is very powerful when you have hierarchical data, but this functionality and the use of reduce function are not available for mobile clients.
N1QL can also be very fast too, when your N1QL query is taking advantage of Couchbase indexes.
Notes
(*) One exception to the rule is when you want to delete a document and have this reflected on mobile phones. The DELETE operation, leaves an empty document with _deleted: true attribute, and can only be done through Sync Gateway. Next time the mobile device synchronizes and finds this hint, it will delete the document from local storage. You can also use set this attribute through a PUT operation, when you may also adding _exp: "2019-12-12T00:00:00.000Z" attribute to perform a programmed purge of the document in a future date, so that the server also gets clean. However, just purging a document through Sync Gateway is equivalent to delete it through Couchbase SDK and this won't reflect on mobile devices.
NOTE: Prior to Sync Gateway 1.5 and Couchbase 5.0, all backend operations had to be done directly in Sync Gateway so that Sync Gateway and mobile clients could detect those changes. This has changed since shared_bucket_access option was introduced. More info here.

Is there such a thing as couchbase basket?

I have heard only of couchbase bucket is there also a basket? I would like to have multiple apps use multiple buckets but for couchbase performance is there a lighter thing than a bucket called basket?
Never heard of a basket in Couchbase. Now that being said we strongly encourage people to add a type field to every document stored in buckets. Before having queries we would tell you to do multiple applications by prefixing all your document keys by the app prefix. Now that we have n1ql and that you can do queries based on the content, you should add a field in the doc as well.
From a security perspective you'll be mixing stuff from different app in the same bucket. We have no way to make any distinction right now between doc from one app or the other on the server side. It means your security model has to be handled on the client/application layer.

Why does the Couchbase Server API require a name for new documents

When you create a document using the Couchbase Server API, one of the arguments is a document name. What is this used for and why is it needed?
When using Couchbase Lite you can create an empty document and it is assigned an _id and _rev. You do not need to give it a name. So what is this argument for in Couchbase Server?
In Couchbase Server it is a design decision that all objects are identified by a the object ID, key or name (all the same thing by different names) and those are not auto-assigned. The reason for this is that keys are not embedded in the document itself, key lookups are the fastest way to get that object and the technology dictates this under the hood of the server. Getting a document by ID is much faster than querying for it. Querying means you are asking a question, whereas getting the object by ID means you already know the answer and are just telling the DB to go get it for you and is therefor faster.
If the ID is something that is random, then more than likely you must query the DB and that is less efficient. Couchbase Mobile's sync_gateway together with Couchbase Lite handles this on your behalf if you want it to as it can have its own keyspace and key pattern it manages for key lookups. If you are going straight to the DB on your own with the Couchbase SDK though, knowing that key will be the fastest way to get that object. Like I said, Couchbase Sync_Gateway handeles this lookup for you, as it is the app server. When you go direct with the SDKs you get more control and different design patterns emerge.
Many people in Couchbase Server create a key pattern that means something to their application. As an example for a user profile store I might consider breaking up the profile into three separate documents with a unique username (in this example hernandez94) for each document:
1) login-data::hernandez94 is the object that has the encrypted password since I need to query that all of the time and want it in Couchbase's managed cache for performance reasons.
2) sec-questions::hernandez94 is the object that has the user's 3 security questions and since I do not use that very often, do not care if it is in the managed cache
3) main::hernandez94 is the user's main document that has everything else that I might need to query often, but not nearly as often as other times.
This way I have tailored my keyspace naming to my application's access patterns and therefor get only the data I need and exactly when I need it for best performance. If I want, since these key names are standardized in my app, I could do a paralellized bulk get on all three of these document since my app can construct the name and it would be VERY fast. Again, I am not querying for the data, I have the keys, just go get them. I could normalize this keyspace naming further depending on the access patterns of my application. email-addresses::hernandez94, phones::hernandez94, appl-settings::hernandez94, etc.

Connect to bucket name that named not 'default'

I have a Couchbase server and a .Net client. When I named the bucket "default" every thing run ok but when I create a bucket with another name like 'cashdb' my client got error "Null Pointer Exception".
I really don't know if you want to have 3 bucket on a server with difference names , what can you do?
When you have multiple buckets (or even a single bucket that's not named 'default'), you have to explicitly specify which one you want to open when creating the connection.
In the 1.x SDK it's:
var config = new CouchbaseClientConfiguration();
config.Bucket = "mybucket"
config.BucketPassword = "12345";
var connection = new CouchbaseClient(config);
In the 2.x SDK it's slightly longer, so take a look here: http://docs.couchbase.com/developer/dotnet-2.0/configuring-the-client.html
I cannot answer the part about the .Net drive, but I can address the multiple bucket questions.
You can have multiple buckets, but know why you are doing it. A logical organization is not necessarily a great reason, IMO. More buckets means more resources being used. I can give you a great example of a when you would split data into separate buckets, views. If you have views that only look at a subset of the data you have and will never ever look at the other parts of the data, then it might make sense to split it out. Say you have some JSON docs that are 30% of your data and a bunch of key value pairs that are 70% of your data. More than likely, you would only ever do views on the JSON docs and if there are enough of those docs and in large enough sizes, it can provide much faster view creation, maintenance, cluster rebalances, etc.
Another reason is if you have multiple applications accessing the same cluster. That is a good reason too.
Anyhow, it is fine to have multiple buckets, just read up on and understand the implications and do it strategically.