How does Zookeeper manage node roles in other clusters? - configuration

My understanding is that Zookeeper is often used to solve the problem of "keeping track of which node plays a particular role" in a distributed system (e.g. master node in a DB or in a MapReduce cluster, etc).
For simplicity, say we have a DB with one master and multiple replicas and the current master node in the DB goes down. In this scenario, one would, in principle, make one of the replica nodes a new master node. At this point my understanding is:
If we didn't have Zookeeper
The application servers may not know that we have a new master node, so they would not know where to send writes unless we have some custom logic on the app server itself to detect / correct this problem.
If we have Zookeeper
Zookeeper would somehow detect this failure, and update the value for the corresponding master key. Moreover, application servers can (optionally?) register hooks in Zookeeper, so Zookeeper can notify them of this failure, so that the app servers can update (e.g. in memory), which DB node is the new master.
My questions are:
How does Zookeper know what node to make master? Is Zookeper responsible for this choice?
How is this information propagated to nodes that need to interact with Zookeeper? E.g. If one of the Zookeeper nodes go down, how would the application servers know which Zookeeper node to hit in this scenario? Does Zookeeper manage this differently from competing solutions like e.g. etcd?

The answer to both 1. and 2. is called leader election process and briefly works in the following way:
When a process starts in a cluster managed by ZK, the cluster enters an election state. If there is a leader then there exists an established hierarcy and the existing leader is just verified. If there is no leader (say master is down), ZK forces the znodes to use sequence flags to look for a new leader. Each node talks to its peers and sends a message containing the node's identifier (sid) and the most recent transaction it executed (zxid). These messages are called votes. When a node receives a vote it can either neglect it or keep it depending the zxid. If zxid is newer it keeps the vote if older than what it has it discards it. If there is a tie in zxids then the vote with the highest sid wins! So there will come a time when all nodes will have the same vote which will define the new leader by the sid. So this is how ZK elects a new leader node!

Related

Database Architecture: Central Database to Work as a Registry for Local Databases

The central database(blue) will hold all customer data of the project.
The local databases(green) will be deployed at the physical locations containing a copy of the customer databases. Multiple stores can be deployed across geographical areas (A, B,...N) to allow customers to register and make purchases.
When a customer is registered at a local store, it should be updated in the central database with the purchase history. When a customer is registered, his purchase history should also be available in other stores.
For example, in the morning, a customer can purchase from store A, and afterward, customers should be able to purchase from store B/C or any other without registering again.
MySQL will be used as the database.
Advise is expected,
Is there a database architecture or pattern that we can
achieve this?
What's the best approach to implement this?
Referred: Database Architecture, Central and/vs Localized Server
There are three popular replication algorithms according:
single-leader. When just there is one leader node
multi-leader. When there are many leader nodes
leadeless. When there is no leader node
Read more about these algorithms in "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" by Martin Kleppmann. As a quick overview, you can read this article "Database replication — an overview".
When a customer is registered at a local store, it should be updated in the central database with the purchase history. When a customer is registered, his purchase history should also be available in other stores.
It looks like you need to use master-master replication or multi-master or multi-leader replication. As wiki says:
Multi-master replication is a method of database replication which
allows data to be stored by a group of computers, and updated by any
member of the group. All members are responsive to client data
queries. The multi-master replication system is responsible for
propagating the data modifications made by each member to the rest of
the group and resolving any conflicts that might arise between
concurrent changes made by different members.
And MySql supports this:
MySQL Group Replication is a MySQL Server plugin that enables you to
create elastic, highly-available, fault-tolerant replication
topologies.
Groups can operate in a single-primary mode with automatic primary
election, where only one server accepts updates at a time.
Alternatively, for more advanced users, groups can be deployed in multi-primary mode, where all servers can accept updates, even if they
are issued concurrently
I highly recommend you to read chapter "Replication" of book "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" by Martin Kleppmann
Short Answer: Nothing standard within MySQL.
Long Answer: It is a tough problem because of network outages, temporary server outages, etc.
Partial solutions:
The "right" answer is to have every "customer" not have its own database, but instead, do all reads and writes on the "Main computer".
To have only local data on each "customer" db (which would be a Primary), the Main could be a Replica receiving updates from each customer. But this says that the only complete copy is on Main.
To have each customer have all the data, you must write to main (Primary) and read locally (Replica).

use of Journal node in high availability

What is the need of Journal node
Why we configure three journal nodes in high availability.
Is it only for replication?
The Role of Journal nodes is to keep both the Namenodes in sync and avoid hdfs split brain scenario by allowing only Active NN to write into journals.
From Apache Hadoop Documentations
Prior to Hadoop 2.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine was unavailable, the cluster on the whole would be unavailable until the NameNode was either restarted or started on a separate machine. In a classic HA cluster, two separate machines are configured as NameNodes. At any point, one of the NameNodes will be in Active state and the other will be in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover.
In order for the Standby node to keep its state coordinated with the Active node, both nodes communicate with a group of separate daemons called ‘JournalNodes’ (JNs). When any namespace modification is performed by the Active node, it logs a record of the changes made, in the JournalNodes. The Standby node is capable of reading the amended information from the JNs, and is regularly monitoring them for changes. As the Standby Node sees the changes, it then applies them to its own namespace. In case of a failover, the Standby will make sure that it has read all the changes from the JounalNodes before changing its state to ‘Active state’. This guarantees that the namespace state is fully synched before a failover occurs.
JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally.
Here is also some good external link about JournalNode
https://www.edureka.co/blog/namenode-high-availability-with-quorum-journal-manager-qjm/
https://community.hortonworks.com/articles/27225/how-qjm-works-in-namenode-ha.html

What is the Master node in Couchbase?

In Couchbase documentation: https://developer.couchbase.com/documentation/server/current/concepts/distributed-data-management.html
There is no concept of master nodes, slave nodes, config nodes, name nodes, head nodes, etc, and all the software loaded on each node is identical
But in my logs I get the message found in post:
https://forums.couchbase.com/t/havent-heard-from-a-higher-priority-node-or-a-master-so-im-taking-over/5924
Haven't heard from a higher priority node or a master, so I'm taking over. mb_master 000 ns_1#10.200.0.10 1:07:38 AM Tue Feb 7, 2017
and
Somebody thinks we're master. Not forcing mastership takover over ourselves mb_master 000 ns_1#10.200.0.10 1:07:28 AM Tue Feb 7, 2017
I am having trouble finding what the master does, because any search about a master results in the comment of couchbase not having a master node.
The error messages seem to originate from the cluster management which should look like this (I didn't manage to find the Couchbase implementation of it). The link points to the implementation of membase which is the predecessor of Couchbase.
While all nodes are equal in Couchbase this is not the case when there is some redistribution of data. As described in detail in this document a master is chosen to manage the redistribution. The log messages you get are caused by this process.
The Master Node in the cluster manager is also known as the orchestrator.
Straight from the Couchbase Server 4.6 documentation, https://developer.couchbase.com/documentation/server/4.6/concepts/distributed-data-management.html
Although each node runs its own local Cluster Manager, there is only
one node chosen from among them, called the orchestrator, that
supervises the cluster at a given point in time. The orchestrator
maintains the authoritative copy of the cluster configuration, and
performs the necessary node management functions to avoid any
conflicts from multiple nodes interacting. If a node becomes
unresponsive for any reason, the orchestrator notifies the other nodes
in the cluster and promotes the relevant replicas to active status.
This process is called failover, and it can be done automatically or
manually. If the orchestrator fails or loses communication with the
cluster for any reason, the remaining nodes detect the failure when
they stop receiving its heartbeat, so they immediately elect a new
orchestrator. This is done immediately and is transparent to the
operations of the cluster.

CouchbaseClient configuration for more than one cluster

Let's assume I have two couchbase clusters with XDCR setup and having following nodes:
n1.cluster1.com
n2.cluster1.com
n3.cluster1.com
and
n1.cluster2.com
n2.cluster2.com
n3.cluster2.com
What is preferable node configuration for CouchbaseClient?
As from http://docs.couchbase.com/couchbase-sdk-java-1.4/#hello-couchbase
The CouchbaseClient class accepts a list of URIs that point to nodes in the cluster. If your cluster has more than one node, Couchbase strongly recommends that you add at least two or three URIs to the list. The list does not have to contain all nodes in the cluster, but you do need to provide a few nodes so that during the initial connection phase your client can connect to the cluster even if one or more nodes fail.
After the initial connection, the client automatically fetches cluster configuration and keeps it up-to-date, even when the cluster topology changes. This means that you do not need to change your application configuration at all when you add nodes to your cluster or when nodes fail. Also make sure you use a URI in this format: http://[YOUR-NODE]:8091/pools. If you provide only the IP address, your client will fail to connect. We call this initial URI the bootstrap URI.
Does it mean I should add at least two or three nodes from each cluster? Or two or three node from the whole system?
Each CouchbaseClient object will only connect to one cluster. The list of node URIs should all belong to the same cluster - you'll likely get strange behaviour if you list nodes from different clusters.
If your application wants to connect to two different cluster (irrespective of if they have a replication stream between them or not), then you want to create two CouchbaseClient objects, one connected to each cluster.
I recommend adding all nodes of the cluster to your client connect configuration. The reason is that if one or more of the nodes are down (i.e. planned shutdown, server crash, etc) the client would still be able to connect to the cluster(s) when it restarted.
Note that client need this list of connect nodes at the time of start up, once communicated with the cluster it will maintain its own track of active/inactive cluster nodes.
I have in production one cluster of 3 nodes and all my clients have all nodes in the connect configuration, e.g.
http://my-node1:8091/pools,http://my-node2:8091/pools,http://my-node3:8091/pools
Regarding multiple clusters I'm not sure it will work with the same client instance unless a Couchbase client instance is smart enough to distinguish multiple clusters and keep track of its nodes health. Read on Couchbase installation guide
I found in documentation if you are using Couchbase Moxi it does support multiple clusters:
Moxi also supports proxying to multiple clusters from a single moxi
instance, where this was originally designed and implemented for
software-as-a-service purposes. Use a semicolon (';') to specify and
delimit more than one cluster:
-z “LISTEN_PORT=[CLUSTER_CONFIG][;LISTEN_PORT2=[CLUSTER_CONFIG2][]]”

Can a webserver determine if its the active node of an HA failover system without hard coding anything on the server itself?

I can think of a few hacks using ping, the box name, and the HA shared name but I think that they are leading to data leakage.
Should a box even know its part of an HA cluster or what that cluster name is? Is this more a function of DNS? Is there some API exposed for boxes to join an HA cluster and request the id of the currently active node?
I want to differentiate between the inactive node and active node in alerting mechanisms for a running program. If the active node is alerting I want to hit a pager and on the inactive node I want to send an email. Pushing the determination into the alerting layer moves the same problem elsewhere.
EASY SOLUTION: Polling the server from an external agent that connects through the network makes any shell game of who is the active node a moot point. To clarify this the only thing that will page is the remote agent monitoring the real. Each box can send emails all day long for all I care.
It really depends on the HA system you're using.
For example, if your system uses a shared IP and the traffic is managed by some hardware box, then it can be hard to determine if a certain box is a master or slave. That will depend on a specific solution really... As long as you can add a custom script to the supervisor, you should be ok - for example the controller can ping a daemon on the master server every second. In the alerting script, simply check if the time of the last ping < 2 sec...
If your system doesn't have a supervisor / controller node, but each node tries to determine the state itself, you can have more problems. If a split brain occurs, you can end up with both slaves or both masters, so your alerting software will be wrong in both cases. Gadgets that can ensure only one live node (STONITH and others) could help.
On the other hand, in the second scenario, if the HA software works on both hosts properly, you should be able to obtain the master/slave information straight from it. It has to know its own state at any time, because it's one of its main functions. In most HA solutions you should be able to either get the current state, or add some code to run when the state changes. Heartbeat offers both.
I wouldn't worry about the edge cases like a split brain though. Almost any situation when you lose connection between the clustered nodes will be more important than the stuff that happens on the separate nodes :)
If the thing you care about is really logging / alerting only, then ideally you could have a separate logger box which gets all the information about the current network / cluster status. External box will probably have better idea how to deal with the situation. If your cluster gets dos'ed / disconnected from the network / loses power, you won't get any alert. A redundant pair of independent monitors can save you from that.
I'm not sure why you mentioned DNS - due to its refresh time it shouldn't be a source of any "real-time" cluster information.
One way is to get the box to export it's idea of whether it is active into your monitoring. From there you can predicate paging/emailing on this status (with a race condition around failover), and alert on none/too many systems believing they are active.
Another option is to monitor the active system via a DNS alias (or some other method to address the active system) and page on that. Then also monitor all the systems, both active and inactive, and email on that. This will cause duplicate alerts for the active system, but that's probably okay.
It's hard to be more specific without knowing more about your setup.
As a rule, the machines in a HA cluster shouldn't really know which one is active. There's one exception, mind, and that's with cronjobs. At work, we have a HA cluster on top of which some rather important services run. Some of those use services have cronjobs, and we only want them running on the active box. To do that, we use this shell script:
#!/bin/sh
HA_CLUSTER_IP=0.0.0.0
if ip addr | grep $HA_CLUSTER_IP >/dev/null; then
eval "$#"
fi
(Note that this is running on Debian.) What this does is check to see if the current box is the active one within the cluster (replace 0.0.0.0 with the external IP of your HA cluster), and if so, executes the command passed in as arguments to the script. This ensures that one and only one box is ever actually executing the cronjobs.
Other than that, there's really no reasons I can think of why you'd need to know which box is the active one.
UPDATE: Our HA cluster uses Heartbeat to assign the cluster's external IP address as a secondary address to the active machine in the cluster. Programmatically, you can check to see if your machine is the current active box by calling gethostbyname(), and iterating over the data returned until you either get to the end or you find the cluster's IP in the list.
Without hard-coding.... ? I assume you mean some native heartbeat query, not sure. However, you could use ifconfig, HA creates a virtual interface on whatever interface it is configured to run on. For instance if HA was configured on eth0 then it would create a virtual interface of eth0:0, but only on the active node.
Therefore you could do a simple query of the ifconfig output to determine if the server twas the active node or not, for example if eth0 was the configured interface:
ACTIVE_NODE=`ifconfig | grep -c 'eth0:0'`
That will set the $ACTIVE_NODE variable to 1 (for active) and 0 (if standby). Hope that may help.
http://www.of-networks.co.uk