namenode recovery hadoop 2.x(active and passive namenode) - hadoop2

in hadoop 2.X we are having the concept of active namenode and passive namenode and both used to write fsimage and editlogs to a shared location.
what happens when cluster looses the access to that shared location.
i mean is there a concept of taking backup of shared location also in hadoop 2.X
Please suggest.

Related

Can we migrate local beanstalkd to GCP Compute instance?

I have local system that have a beanstalkd application running.Now i want to migrate to GCP compute instances.
I have installed beanstalkd on CE.So how i can migrate this to CE?
The steps are straightworfard
make sure binlog is activated
locate the binlog file
stop the Beanstalkd instance, for fully graceful shutdown, so binlog file should be flushed
upload to Cloud Storage, (either using the Console or using gcloud utility)
on the Compute Engine instance, make sure you have Storage permission in the edit section
download using gcloud commands the binlog file to the machine itself
start the beanstalkd service and woala, your persisted messages are there
you can setup on the same machine or other machines Beanstalkd Console Admin

GCE snapshot vs VMware snapshot

I am now to GCE and now looking for a way to backup our GCE VM.
As per Google documentation, it seems that Google recommends performing the VM backup by creating snapshots. However, in VMware, using snapshots as a backup method is not recommended as the delta disk will grow and the system may be unstable.
I wonder if the way GCE handle snapshot different with Vmware so in GCE, snapshots can be used as a backup method?
Thanks
Yes, you can use snapshots as backup. You can take backup of a persistent disk by creating snapshots even while they are attached to running instances. Snapshots are global resources, so you can use them to restore data to a new disk or instance within the same project. You can also share snapshots across projects.
You can refer to the public documentation of GCP on persistent disk snapshots: https://cloud.google.com/compute/docs/disks/create-snapshots
In GCP you can use the instance templates to create VMs and you can use it as backup for creating VMs when you need.Instance templates define the machine type, boot disk image or container image, labels, and other instance properties. You can then use an instance template to create a MIG or to create individual VMs. Instance templates are a convenient way to save a VM instance's configuration so you can use it later to create VMs or groups of VMs.
You can refer to the public documentation of GCP on Instance templates : https://cloud.google.com/compute/docs/instance-templates
Note on backups and persistent disk snapshots
"If you create a snapshot of your persistent disk while your application is running, the snapshot might not capture pending writes that are in transit from memory to disk." -- Best practices for persistent disk snapshots. This could lead to snapshots which aren't actually a good backup, because the disk is in an unexpected state. The linked doc gives recommendations on pausing applications before snapshotting to ensure consistent backups.
Another option (for applications which support it), is to do application level backups using their tools. Databases, for example, usually have their own tooling for backups which understand application state and create reliable backups.
Are GCE persistent disk snapshots different than VMWare snapshots?
Yes, they are different. GCE PD doesn't stack deltas of the changes over time. The two systems use the word snapshot, but the underlying mechanism is different.
In VMWare, if you take multiple snapshots, each snapshot freezes the current disk state and future snapshots are based on the differences (deltas) from the previous ones, creating a chain of differences and the disk is the most recent state. A snapshot is a point in time of the evolution of the disk data. This allows directly rolling back the disk, but if you store a nightly snapshot as a backup, you can't just delete the old snapshots because new ones rely on the old ones. (There are options to compact them, though)
In Compute Engine, snapshots are separate objects and each snapshot is based on the state of the original disk at that time. A snapshot is a logical copy of the disk at a certain time. If you take nightly backups, you can delete old ones because they are independent from each other.
Also, Compute Engine snapshots are only snapshots of the disk, not the machine configuration. Machine Images capture both machine configuration and the disk state.

Is it possible to run IoT-Agent for Ultralight 2.0 without MondoDB link (with memory type of data hold)?

During configuring IoT-Agent for Ultralight 2.0 there is a possibility to set docker variable IOTA_REGISTRY_TYPE- Whether to hold IoT device info in memory or in a database (mongodb by default). Documentation that I'm referencing.
Firstly I would like to have it set for memory and what would it imply?
Could the data be preserved only in some allocated part of memory within docker env.? Could I omit further variables within configuration file, like IOTA_MONGO_HOST (The hostname of mongoDB - used for holding device information).
Architecture for my system has raspberry pi running IoT Agent and VM running Orion Context Broker and MongoDB. Both are reachable because they see each other in LAN. Is it necessary for MongoDB to be the same database for IoT Agent and Orion Context Broker if they are linked?
Is it possible to run IoT Agent with memory only type of device information persistence (instead of database type)? Will it have any effect on whole infrastructure running besides of obvious lack of device data holding?
Firstly I would like to have it set for memory and what would it imply?
There would be no need for a MongoDB database attached to the IoT Agent, there would be no persistence of provisioned devices in the event of disaster recovery.
Could the data be preserved only in some allocated part of memory within docker env.?
No
Could I omit further variables within configuration file, like IOTA_MONGO_HOST (The hostname of mongoDB - used for holding device information).
The Docker Env parameters are merely overrides to the values found in the config.js within the Enabler itself, so all of the ENV variables can be omitted if you are using defaults.
Is it necessary for MongoDB to be the same database for IoT Agent and Orion
Context Broker if they are linked?
The IoT Agent and Orion can run entirely separately and usually would use separate MongoDB instances. At least this would be the case in a properly architected production environment.
The Step-by-Step Tutorials are lumping everything together on one Docker engine for simplicity. A proper architecture has been sacrificed to keep the narrative focused on the learning goals. You don't need two Mongo-DB instances to handle less than 20 dummy devices.
When deploying to a production environment, try looking at the SmartSDK Recipes
in order to scale up to a proper architecture:
see: https://smartsdk.github.io/smartsdk-recipes/
Is it possible to run IoT Agent with memory only type of device information
persistence (instead of database type)? Will it have any effect on whole
infrastructure running besides of obvious lack of device data holding?
I haven't checked this, but there may be a slight difference in performance since memory access should be slightly faster. The pay-off is that you will lose the provisioned state of all devices if failure occurs. If you need to invest in disaster recovery then Mongo-DB is the way to go, and periodically back-up your database so you can always return to last-known-good

dataproc cluster on google cloud

My understanding is that running a Dataproc cluster instead of setting up your own compute engine cluster is that it takes care of installing the storage connector (and other connectors). What else does it do for you?
The most significant feature of Dataproc beyond a DIY cluster is the ability to submit Jobs (Hadoop & Spark jars, Hive queries etc.) via an API, WebUI and CLI without configuring tricky network firewalls and exposing YARN to the world.
Cloud Dataproc also takes care of a lot of configuration and initialization such as setting up A shared Hive Metastore for Hive and Spark. And allows specifying Hadoop, Spark, etc. properties at boot time.
It boots a cluster in ~90s, which in my experience is faster than most cluster setups. This allows you to tear down the cluster when you are not interested and not have to wait tens of minutes to bring a new one up.
I'd encourage you to look at a more comprehensive list of features.

Attaching a disk to multiple GCE instances with one write access

I'm thinking about upgrading my company's integration server with a the repos on a separate disk that would be shared with a backup server. Like so:
[Main Integration Server] ---R/W--- [Repo Vdisk] ---R/O--- [Backup Integration Server]
My problem is that according to the GCE docs, if I attach the same Vdisk to more than one instance, all instances must only access the disk in read-only mode. What I'm looking to do would be to have one instance access it in read-write, and one in read-only mode.
Is this at all possible without powering up a third instance to act as a sort of "storage server"?
As you quoted from the docs and as mentioned in my earlier answer, if you attach a single persistent disk to multiple instances, they must all mount it in read-only mode.
Since you're looking for a fully-managed storage alternative so you don't have to run and manage another VM yourself, consider using Google Cloud Storage and mount your bucket with gcsfuse which will make it look like a regular mounted filesystem.