Difference between node unresponsive and node failover in Couchbase - couchbase

What is the Difference between node unresponsive and node failover in Couchbase while I was monitoring my cluster I can see one node is in unresponsive state what is the exact difference between both

An "unresponsive" node is one that is not responding to requests. A node could be unresponsive because of a network or hardware problem, or an internal server error.
Failover is what you do to an unresponsive node to forcibly remove it from the cluster.
Further reading:
Remove a node and rebalance (a graceful way to remove a node that IS reponsive)
Fail a Node Over and Rebalance (a less graceful way that works with unresponsive nodes and can potentially result in loss of data that hasn't yet been replicated from the failed node)
Recovery (after you fail over a node, if it starts behaving again you can add it back into the cluster)

Related

Can I run user pods on Openshift master or Infra nodes

New to openshift so I am bit confused if we can run user pods on master or infra nodes. We have 2 worker nodes and one master and infra nodes each making 4 nodes. The reason for change is to share the loads between all 4 nodes rather than 2 compute nodes.
By reading some documents it seems possible to assign 2 roles to one node but is there any security risk or is it not best practice?
Running on openshift version v3.11.0+d699176-406
if we can run user pods on master or infra nodes
Yes, you absolutely can, the easiest way is to configure that at installation time, refer to https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-using-native-ha-ai for example
is there any security risk or is it not best practice
Running single master node or a single infra node is already a risk to high availability of your cluster. If master fails - your cluster is basically headless, if infra node fails - you lose your internal registry and routers - thus losing an external access or an ability to create new images for your imagestreams. This also applies to host OS upgrades, you will have to reboot master and infra nodes some day, are you okay with having a guaranteed downtime during patching? What if something goes wrong during update?
Regarding running user workload on master and infra nodes - if you are not running privileged SCCs (which can allow to run privileged pods or using any uids on system etc) - you are somewhat safe from container breach, assuming there are no known bugs in container engine you are using. However you should pay close attention to your resource consumption and avoid running any workloads without CPU and memory limits, because overloading master node may result in cluster degradation. You should also monitor disk usage, since running user pods results in more images loaded to your docker storage.
So basically it boils down to this:
Better have multiple master (ideally 3) and couple of infra nodes than a single point of failure for both of these roles. Having separate master and worker nodes is of course better than hosting them together, but multimaster setup even with user workloads should be more resilient if you are watching resource usage carefully.

Couchbase compaction and data loss

We had an issue with our cluster recently (running on 4.5 currently)… Compaction was not completing, it seemed to be continually running on a single node and the disk was getting close to full. We canceled the compaction and had to bounce the node to get things stable. We are using a 4 node cluster with a replication factor of 2.
What we are seeing now is some documents missing, and some documents reverted to an earlier version.
Is it possible that the scenario described above could cause documents to be missing or reverted to a previous version?

How to recovery cluster when all nodes down?

If all nodes in a 3-nodes Percona Cluster have shutdown (gracefully shutdown or crash), from this blog, it says that when the nodes could reach each other, the cluster could recover automatically. However, starting the nodes in such a situation seems a difficult task.
So is there a reliable and operable method to do cluster recovery in this situation?
Examine the grastate.dat file on all 3 nodes. Which node has the highest sequence number? You should bootstrap that node. Wait for it to come online. Then start node2. It should IST from the bootstrap node. Then start node3.
Golden rule: You must always bootstrap the very first node of any cluster. Bootstrapping does not erase data; it only starts a new cluster.
Depending on the version, you may need to set safe_to_bootstrap in the grastate file to 1 manually.
Another thing you can try to check which is most advanced node
run below command on every node and check which node has largest committed transaction value.
mysqld_safe --wsrep-recover
start First node which has max committed value then second and third

Amazon Aurora DB Cluster Not Auto Balancing Correctly

I have created an Amazon Aurora Database cluster runing MySQL with three instances: the main instance that backs the cluster and two read replicas for balancing. However, the cluster does not seem to be balancing the reads at all. I have one replica managing 700+ Selects/sec maximizing the CPU at 99.75% or higher while the other replica is doing virtually nothing with a CPU usage of 4% at 1 select per second, if that. The main cluster instance itself is at 33% CPU usage as it is being written to simultaneously while the replicas should are being read from. The lag time between the replicas is under 20 milliseconds. My application is querying the read only endpoint of the cluster but no balancing appears to be happening. Does anyone have any insight into why this may be happening or why the replica is at such a high CPU usage? The queries being ran against it are not complex by any means.
Aurora Cluster endpoints are DNS records and they only do DNS round robin during resolution. This means that when your client application opens connections to a cluster endpoint, you end up resolving the endpoint to different instances (different IPs basically), there by striping your connections across multiple replicas. Past that point, there is no load balancing. Connections are striped across instances, and queries run on each of those connections go to the corresponding instance backing it.
Now consider the scenario where your connection pool was already created to the cluster endpoint when you have one instance behind it. Now, if you add more instances, there will be no impact to your application, unless you terminate your connection and reestablish them. You would do a DNS round robin again, and this time some of your connections would land on the new instance that you provisioned.
Few callouts:
In Aurora, you have 2 cluster endpoints. One (RW) endpoint always points to the current writer and one (RO) does the DNS round robin between your read replicas.
Also, DNS propagation might take a few seconds when failovers happen, so that occasional errors are quite natural when failovers occur.
Hope this helps.
We've implemented a driver to try to mitigate this problem, with some visible gains: https://github.com/DiceTechnology/dice-fairlink
It regularly discovers the read-replicas to catch up with cluster changes and round-robins connections among them.
Despite not measuring any CPU utilisation, we've observed a better load distribution than with the native DNS based round-robin of the cluster reader endpoint
The Aurora's DNS based load balancing works at the connection level (not the individual query level). You must keep resolving the endpoint without caching DNS to get a different instance IP on each resolution. If you only resolve the endpoint once and then keep the connection in your pool, every query on that connection goes to the same instance. If you cache DNS, you receive the same instance IP each time you resolve the endpoint.
Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. Ensure that your network and client configurations don’t further increase the DNS cache TTL. Remember that DNS caching can occur anywhere from your network layer, through the operating system, to the application container. For example, Java virtual machines (JVMs) are notorious for caching DNS indefinitely unless configured otherwise. Here are AWS documentation and Aurora whitepaper on configuring DNS cache ttl.
My guess is that you are not connecting to the cluster endpoint.
Load Balancing – Connecting to the cluster endpoint allows Aurora to load-balance connections across the replicas in the DB cluster. This helps to spread the read workload around and can lead to better performance and more equitable use of the resources available to each replica. In the event of a failover, if the replica that you are connected to is promoted to the primary instance, the connection will be dropped. You can then reconnect to the reader endpoint in order to send your read queries to the other replicas in the cluster.
New Reader Endpoint for Amazon Aurora – Load Balancing & Higher Availability
[EDIT]
To load balance within a single application, you will need to reconnect to the endpoint. If you use the same connection for all queries only one replica will be responding. However, opening connections is expensive so this might not provide much benefit unless your queries take some time to run.

what happens when single and one and only management node goes down in NDB Cluster

I have set up the MySQL NDB Cluster (mysql-cluster-gpl-7.3.5-linux-glibc2.5-x86_64) with 5 Nodes as described below:
Node A: multithreaded data node1, SQL node1
Node B: multithreaded data node2, SQL node2
Node C: management node1
So I have kept one and only one management node which handles other nodes.
When transactions are going on, I suddenly kill the process of management node and still other nodes are kept running. Even the response times from both DBs(SQL Nodes) are not fluctuating.
Can you explain me what happens at this moment?
Does the SQL Nodes sync in this scenario?
OR
Do they need management node to keep them in sync?
Thanks in advance.
Management node acts as arbitrator of the data nodes. If mgmt node goes down and both data nodes see each other, cluster has quorum and operates normally, syncing data. Mgmt node role is simply keeping cluster configuration as arbitrate, and it's not involved in data sync.