Heterogeneous Couchbase Cluster - couchbase

I would like to run a couchbase cluster on a hardware cluster that's not uniform. Some of the machines have 1 CPU core, while the others have 16 cores.
Is there a way to configure the bucket size or request frequency so the larger servers can receive a larger percentage of the load?
What I'm looking for is something similar to the weighting in ketama, but for Couchbase.

You stated in a previous answer that:
Usually 1 small instance and 4-20 large ones. The small one basically only exists for cluster discovery.
What you should really do is connecting to your Couchbase cluster through a reverse-proxy (such as Haproxy) instead of a constantly-up node. The reverse-proxy will have all the potential nodes in its pool, constantly looking for which nodes are really up, and dispatching connections to those nodes. As soon as a node goes down, the connection will be re-established to an alive node.
You can read more about this architecture description on the Couchbase documentation.

No, there isn't method that does what you want. Keys in CB throws its own hash-function mapped to VBuckets, and VBuckets mapped to server. Couchbase API doesn't allow managed this mapping. All you can do is determined by the id of document server that owns this document.

Related

Where can I find the clear definitions for a Couchbase Cluster, Couchbase Node and a Couchbase Bucket?

I am new to Couchbase and NoSQL terminologies. From my understanding a Couchbase node is a single system running a Couchbase Server application and a collection of such nodes having the same data by replication form a Couchbase Cluster.
Also, a Couchbase Bucket is somewhat like a table in RDBMS wherein you put your documents. But how can I relate the Node with the Bucket? Can someone please explain me about it in simple terms?
a Node is a single machine (1 IP/ hostname) that executes Couchbase Server
a Cluster is a group of Nodes that talk together. Data is distributed between the nodes automatically, so the load is balanced. The cluster can also provides replication of data for resilience.
a Bucket is the "logical" entity where your data is stored. It is both a namespace (like a database schema) and a table, to some extent. You can store multiple types of data in a single bucket, it doesn't care what form the data takes as long as it is a key and its associated value (so you can store users, apples and oranges in a same Bucket).
The bucket acts gives the level of granularity for things like configuration (how much of the available memory do you want to dedicate to this bucket?), replication factor (how many backup copies of each document do you want in other nodes?), password protection...
Note that I said that Buckets where a "logical" entity? They are in fact divided into 1024 virtual fragments which are spread between all the nodes of the cluster (that's how data distribution is achieved).

Running couchbase on AWS under auto scaling

I want to run couchbase on AWS EC2. Since my traffic is cyclic in nature, can I run Couchbase under auto-scaling. Since there are a lot of steps required to add/remove a node, I was wondering if this is the right approach. Has anybody tried it ?
It has been done before. Here is a high level list of the things you'd have to do:
Define which Couchbase metrics you need to use to base your scaling considerations on
Create a script to get those metrics from Couchbase and put them into Cloudwatch using Couchbase Rest API or CLI.
Create an AMI with Couchbase installed and OS configured.
Script the addition of one or more new nodes (using Couchbase Rest API or CLI), plus a rebalance, as a response to Auto-scaling
Script the removal and rebalance of nodes (using Couchbase Rest API or CLI), as a response to contraction in auto-scaling.
With you reliying on rebalances here, you will have to watch how long your rebalances take and perhaps tune your cluster (e.g. move more vBuckets at once and other settings) and usage of Couchbase for faster rebalances (e.g. if you have large views, they can have an effect on rebalances). Normally rebalances are meant to be a background process and take as long as they take, but that may not be appropriate in this particular use. Only you can answer that.

CouchbaseClient configuration for more than one cluster

Let's assume I have two couchbase clusters with XDCR setup and having following nodes:
n1.cluster1.com
n2.cluster1.com
n3.cluster1.com
and
n1.cluster2.com
n2.cluster2.com
n3.cluster2.com
What is preferable node configuration for CouchbaseClient?
As from http://docs.couchbase.com/couchbase-sdk-java-1.4/#hello-couchbase
The CouchbaseClient class accepts a list of URIs that point to nodes in the cluster. If your cluster has more than one node, Couchbase strongly recommends that you add at least two or three URIs to the list. The list does not have to contain all nodes in the cluster, but you do need to provide a few nodes so that during the initial connection phase your client can connect to the cluster even if one or more nodes fail.
After the initial connection, the client automatically fetches cluster configuration and keeps it up-to-date, even when the cluster topology changes. This means that you do not need to change your application configuration at all when you add nodes to your cluster or when nodes fail. Also make sure you use a URI in this format: http://[YOUR-NODE]:8091/pools. If you provide only the IP address, your client will fail to connect. We call this initial URI the bootstrap URI.
Does it mean I should add at least two or three nodes from each cluster? Or two or three node from the whole system?
Each CouchbaseClient object will only connect to one cluster. The list of node URIs should all belong to the same cluster - you'll likely get strange behaviour if you list nodes from different clusters.
If your application wants to connect to two different cluster (irrespective of if they have a replication stream between them or not), then you want to create two CouchbaseClient objects, one connected to each cluster.
I recommend adding all nodes of the cluster to your client connect configuration. The reason is that if one or more of the nodes are down (i.e. planned shutdown, server crash, etc) the client would still be able to connect to the cluster(s) when it restarted.
Note that client need this list of connect nodes at the time of start up, once communicated with the cluster it will maintain its own track of active/inactive cluster nodes.
I have in production one cluster of 3 nodes and all my clients have all nodes in the connect configuration, e.g.
http://my-node1:8091/pools,http://my-node2:8091/pools,http://my-node3:8091/pools
Regarding multiple clusters I'm not sure it will work with the same client instance unless a Couchbase client instance is smart enough to distinguish multiple clusters and keep track of its nodes health. Read on Couchbase installation guide
I found in documentation if you are using Couchbase Moxi it does support multiple clusters:
Moxi also supports proxying to multiple clusters from a single moxi
instance, where this was originally designed and implemented for
software-as-a-service purposes. Use a semicolon (';') to specify and
delimit more than one cluster:
-z “LISTEN_PORT=[CLUSTER_CONFIG][;LISTEN_PORT2=[CLUSTER_CONFIG2][]]”

Couchbase XDCR and haproxy

I intend to setup a Couchbase system with two cluster: the main cluster is active and another one for backup (use XDCR to replicate). Use haproxy in front of this Couchbase system to switch (manual) from active cluster to backup cluster when active cluster down.
Before test, i want to ask some advice for this topology. Is there any problem with this. Can i run smoothly in production environment???
I thought i can not use vbucket awareness client in this topology. Because client only know haproxy, i can not send direct request from client to couchbase server (has vbucket for specific document). Is that right???
From your scenario it sounds like overhead. Why would you keep "stand by" cluster as a backup?
Instead, you can have all four instances of couchbase servers as one cluster (each instance running on its own box)...so you will take full advantage of vBucket architecture that it will be native-managed. If one of the instances is down, you will have no loss of data since the enabled replication will have mirror copy in the other nodes.
We use this setup in production with no issues. From time to time we bring one of the instances down for maintenance and the rest of the cluster still runs and its completely transparent to the Couchbase clients, e.g. no down time!
In my opinion XDCR makes sense for geographically separated locations (so you keep one cluster in Americas another in EMEA and so on). If all your instances in the same location, then Couchbase cluster technology will deliver high-availability (HA) with fail-over support already build in.

Does the MySQL NDB Cluster consider node distance? Will it use the replicates if they are nearer?

I'm building a very small NDB cluster with only 3 machines. This means that machine 1 will serve as both MGM Server, MySQL Server, and NDB data node. The database is only 7 GB so I plan to replicate each node at least once. Now, since a query might end up using data that is cached in the NDB node on machine one, even if it isn't node the primary source for that data, access would be much faster (for obvious reasons).
Does the NDB cluster work like that? Every example I see has at least 5 machines. The manual doesn't seem to mention how to handle node differences like this one.
There are a couple of questions here :
Availability / NoOfReplicas
MySQL Cluster can give high availability when data is replicated across 2 or more data node processes. This requires that the NoOfReplicas configuration parameter is set to 2 or greater. With NoOfReplicas=1, each row is stored in only one data node, and a data node failure would mean that some data is unavailable and therefore the database as a whole is unavailable.
Number of machines / hosts
For HA configurations with NoOfReplicas=2, there should be at least 3 separate hosts. 1 is needed for each of the data node processes, which has a copy of all of the data. A third is needed to act as an 'arbitrator' when communication between the 2 data node processes fails. This ensures that only one of the data nodes continues to accept write transactions, and avoids data divergence (split brain). With only two hosts, the cluster will only be resilient to the failure of one of the hosts, if the other host fails instead, the whole cluster will fail. The arbitration role is very lightweight, so this third machine can be used for almost any other task as well.
Data locality
In a 2 node configuration with NoOfReplicas=2, each data node process stores all of the data. However, this does not mean that only one data node process is used to read/write data. Both processes are involved with writes (as they must maintain copies), and generally, either process could be involved in a read.
Some work to improve read locality in a 2-node configuration is under consideration, but nothing is concrete.
This means that when MySQLD (or another NdbApi client) is colocated with one of the two data nodes, there will still be quite a lot of communication with the other data node.