In a setup where Name Node High Availability is not configured, how does the secondary name node handle the checkpointing operations.
If the High Availability is not configured then the Secondary Namenode is there by default as it was in Hadoop 1.
If you're not aware of Hadoop 1 concept of Secondary Namenode and checkpointing then I can give you a short description but you may want to refer Apache Docs
Checkpointing concept says :
There will be an edit log generated after few seconds containing all the changes done in HDFS(like: file permissions, file name, ACL permissions, replication factor etc) but these changes are temporarily stored in edit-logs and will be permanently merged in to the fsimage when the checkpointing is done.
FYI( Chechkpointing is done after every 60 mins).
The edit-logs and fsimage generated by Namenode will be stored in LFS(Local file system) and one copy of that fsimage will be sent to Secondary Namenode. Now, Why it is called Backup node? Because if in case Namenode goes down or loses it's metadata information then it can contact SNN(Secondary NN) for last saved fsimage and can restore the metadata information.
It's the basic idea behind NN and Secondary NN
Related
My understanding is that Zookeeper is often used to solve the problem of "keeping track of which node plays a particular role" in a distributed system (e.g. master node in a DB or in a MapReduce cluster, etc).
For simplicity, say we have a DB with one master and multiple replicas and the current master node in the DB goes down. In this scenario, one would, in principle, make one of the replica nodes a new master node. At this point my understanding is:
If we didn't have Zookeeper
The application servers may not know that we have a new master node, so they would not know where to send writes unless we have some custom logic on the app server itself to detect / correct this problem.
If we have Zookeeper
Zookeeper would somehow detect this failure, and update the value for the corresponding master key. Moreover, application servers can (optionally?) register hooks in Zookeeper, so Zookeeper can notify them of this failure, so that the app servers can update (e.g. in memory), which DB node is the new master.
My questions are:
How does Zookeper know what node to make master? Is Zookeper responsible for this choice?
How is this information propagated to nodes that need to interact with Zookeeper? E.g. If one of the Zookeeper nodes go down, how would the application servers know which Zookeeper node to hit in this scenario? Does Zookeeper manage this differently from competing solutions like e.g. etcd?
The answer to both 1. and 2. is called leader election process and briefly works in the following way:
When a process starts in a cluster managed by ZK, the cluster enters an election state. If there is a leader then there exists an established hierarcy and the existing leader is just verified. If there is no leader (say master is down), ZK forces the znodes to use sequence flags to look for a new leader. Each node talks to its peers and sends a message containing the node's identifier (sid) and the most recent transaction it executed (zxid). These messages are called votes. When a node receives a vote it can either neglect it or keep it depending the zxid. If zxid is newer it keeps the vote if older than what it has it discards it. If there is a tie in zxids then the vote with the highest sid wins! So there will come a time when all nodes will have the same vote which will define the new leader by the sid. So this is how ZK elects a new leader node!
I am planning to store data into the couch-base cluster.
I would like to know what will happen if my couch base goes down for the following scenarios:
[Consider that there were no active transactions happening]
A node goes down from the cluster.(My assumption is after the node is fixed and is up, it will sync up with other nodes, and the data would be there). Also, let me know after it is synced up will there still be any data loss?
The cluster went down and is fixed and restarted.
Please let me know data persistence analogy for the above scenarios.
Yes, Couchbase persists data to disk. It writes change operations to append-only files on the Data Service nodes.
Data loss is unlikely for your two scenarios because there are no active transactions.
Data loss can occur if a node fails
while persisting a change to disk or
before completing replication to another node if the bucket supports replicas.
Example: Three Node Cluster with Replication
Consider the case of a three node Couchbase cluster and a bucket with one replica for each document. That means that a single document will have a copy stored on two separate nodes, call those the active and the replica copies. Couchbase will shard the documents equitably across the nodes.
When a node goes down, about a third of the active and replica copies become unavailable.
A. If a brand new node is added and the cluster rebalanced, the new node will have the same active and replica copies as the old one did. Data loss will occur if replication was incomplete when the node failed.
B. If the node is failed over, then replicas for the active documents on the failed node will become active. Data loss will occur if replication was incomplete when the node failed.
C. If the failed node rejoins the cluster it can reuse its existing data so the only data loss would be due to a failure to write changes to disk.
When the cluster goes down, data loss may occur if there is a disk failure.
What is the need of Journal node
Why we configure three journal nodes in high availability.
Is it only for replication?
The Role of Journal nodes is to keep both the Namenodes in sync and avoid hdfs split brain scenario by allowing only Active NN to write into journals.
From Apache Hadoop Documentations
Prior to Hadoop 2.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine was unavailable, the cluster on the whole would be unavailable until the NameNode was either restarted or started on a separate machine. In a classic HA cluster, two separate machines are configured as NameNodes. At any point, one of the NameNodes will be in Active state and the other will be in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover.
In order for the Standby node to keep its state coordinated with the Active node, both nodes communicate with a group of separate daemons called ‘JournalNodes’ (JNs). When any namespace modification is performed by the Active node, it logs a record of the changes made, in the JournalNodes. The Standby node is capable of reading the amended information from the JNs, and is regularly monitoring them for changes. As the Standby Node sees the changes, it then applies them to its own namespace. In case of a failover, the Standby will make sure that it has read all the changes from the JounalNodes before changing its state to ‘Active state’. This guarantees that the namespace state is fully synched before a failover occurs.
JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally.
Here is also some good external link about JournalNode
https://www.edureka.co/blog/namenode-high-availability-with-quorum-journal-manager-qjm/
https://community.hortonworks.com/articles/27225/how-qjm-works-in-namenode-ha.html
I'm trying to pin down something in the Gemfire documentation around region back-ups.
http://gemfire.docs.pivotal.io/geode/reference/topics/cache_xml.html#region
Scroll down to the SCOPE attribute...
Using the SCOPE attribute on REGION-ATTRIBUTES I'm assuming that SCOPE="DISTRIBUTED-ACK" would mean a SYNC back-up operation on a REGION and that SCOPE="DISTRIBUTED-NO-ACK" means a ASYNC back-up operation.
The REGION in question is PARTITIONED. I understand that REPLICATED regions default to DISTRIBUTED-ACK.
Would this assumption be correct? e.g. that via configuration Gemfire allows to configure SYNC or ASYNC back-up operations for REGION entry updates.
Backups actually operate at the level of disk stores and files, not individual regions. The backup operation will create a copy of all of the disk store files, which may contain data for many regions with different scopes. The gfsh backup disk-store command will always wait for the backup to complete. So the region scope doesn't really affect whether the backup command is synchronous or asynchronous.
If you use DISTRIBUTED_NO_ACK scope, it does mean that a put could complete before all members receive the update, so technically there is no guarantee that a put on a NO_ACK region will be part of a backup that happens after the put.
I see an apparent random problem about once a month that is doing my head in. Google appears to be changing the naming convention for additional disks (to root) and how they are presented under /dev/disk/by-id/ at boot.
All the time the root disk is available as /dev/disk/by-id/google-persistent-disk-0
MOST of the time the single extra disk we mount is presented as /dev/disk/by-id/google-persistent-disk-1
We didn't give this name but we wrote our provisioning scripts to expect this convention.
Every now and then, on rebooting the VM, our startup scripts fail in executing a safe mount:
/usr/share/google/safe_format_and_mount -m "mkfs.ext4 -F" /dev/disk/by-id/google-persistent-disk-1 /mountpoint
They fail because something has changed the name of the disk. Its no longer /dev/disk/by-id/google-persistent-disk-1 its now /dev/disk/by-id/google-{the name we gave it when we created it}
Last time I updated our startup scripts to use this new naming convention it switched back an hour later. WTF?
Any clues appreciated. Thanks.
A naming convention beyond your control is not a stable API. You should not write your management tooling to assume this convention will never be changed -- as you can see, it's changing for reasons you have nothing to do with, and it's likely that it will change again. If you need access to the list of disks on the system, you should query it through udev, or you can consider using /dev/disk/by-uuid/ which will not change (because the UUID is generated at filesystem creation) instead of /dev/disk/by-id/.