Couchbase stuck at re-balancing 0 nodes - couchbase

I have a Couchbase cluster of 3 nodes. 2 of them went down briefly and came back up, after which I started re-balancing from couchbase console.
However, the console is showing Rebalancing 0 nodes for past 28 hours. I tried to stop the rebalance but it's stuck with the same message.
When I used command line tool to stop the rebalance I get the following message:
$ /opt/couchbase/bin/couchbase-cli rebalance-stop -c 127.0.0.1:8091 -u my-admin -p my-password
$ SUCCESS: rebalance cluster stopped
Yet, it's not actually stopped and the popover in console is still there. Is there any way I can fix it? I have already tried restarting both servers (that are stuck in pending).
EDIT:
I ended up pushing data on a different cluster using XDCR and then shutting down the entire cluster (even restarting all nodes didn't work). Some data was lost.
I'm keeping this open in case anyone has a better solution for such situation.

Looks like this is a known issue with Couchbase since at least version 2.5.1 through version 4.0.0:
https://forums.couchbase.com/t/rebalance-stuck-at-0-and-does-not-cancel/6568
Looks like you need to know how their erlang setup works and how to restart ns_server from there.

Have you tried to remove the node from the cluster and then re-add it to the cluster?

Related

Google compute engine, instance dead? How to reach?

I have a small instance running in GCE, had some troubles with the MongoDb so after some tries decided to reset the instance. But... it didn't seem to come back online. So i stopped the instance and restarted it.
It is an Bitnami MEAN stack which starts apache and stuff at startup.
But... i can't reach the instance! No SCP, no SSH, no webservice running. When i try to connect via SSH (in GCE) it times out, cant make connection on port 22. In the information it says 'The instance is booting up and sshd is not running yet', which is possible of course.... But i cant reach the instance in no possible manner not even after an hour wait :) Not sure what's happening if i cant connect to it somehow :(
There is some activity in the console... some CPU usage, mostly 0%, some incomming traffic but no outgoing...
I hope someone can give me a hint here!
Update 1
After the helpfull tip form Serhii... if found this in the logs...
Booting from Hard Disk 0...
[ 0.872447] piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1: Inodes that were part of a corrupted orphan linked list found.
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
fsck exited with status code 4
The root filesystem on /dev/sda1 requires a manual fsck
Update 2...
So, i need to fsck the drive...
Created a snapshot, made a new disk from that snapshot, added the new disk as an extra disk to another instance. Now that instance wont boot with the same problem... removing the extra disk fixed it again. So adding the disk makes it crash even though it isn't the boot-disk?
First, have a look at the Compute Engine -> VM instances -> NAME_OF_YOUR_VM -> Logs -> Serial port 1 (console) and try to find errors and warnings that could be connected to lack of free space or SSH. It'll be helpful if you updated your post by providing this information. In case if your instance run out of free space follow this instructions.
You can try to connect to your VM via Serial console by following this guide, but keep in mind that:
The interactive serial console does not support IP-based access
restrictions such as IP whitelists. If you enable the interactive
serial console on an instance, clients can attempt to connect to that
instance from any IP address.
more details you can find in the documentation.
Have a look at the Troubleshooting SSH guide and Known issues for SSH in browser. In addition, Google provides a troubleshooting script for Compute Engine to identify issues with SSH login/accessibility of your Linux based instance.
If you still have a problem try to use your disk on a new instance.
EDIT It looks like your test VM is trying to boot from the disk that you created from the snapshot. Try to follow this guide.
If you still have a problem, you can try to recreate the boot disk from a snapshot to resize it.

Google Cloud SQL instance always in Maintenance status & Binary logs issue

I've had some of Google Cloud SQL MySQL 2nd Gen 5.7 instances with failover replications. Recently I noticed that the one of the instance overloaded with the storage overloaded with binlogs and old binlogs not deleted for some reason.
I tried restart this instance but it wont start since 17 March.
Normal process with binlogs on other server:
Problem server. Binlogs not clearing and server wont start and always under maintenance in the gcloud console.
Also I created one other server with same configuration and not binlogs never clearing. I have already 5326 binlogs here when on normal server I have 1273 binlogs and they are clearing each day.
What I tried with the problem server:
1 - delete it from the Google Cloud Platform frontend. Response: The instance id is currently unavailable.
2 - restart it with the gcloud command. Response: ERROR: (gcloud.sql.instances.restart) HTTPError 409: The instance or operation is not in an appropriate state to handle the request. Same response on any other command which I sent with the gcloud.
Also I tried to solve problem with binlogs to configure with expire_logs_days option, but it seems this option not support by google cloud sql instance.
After 3 days of digging I found a solution. Binlogs must cleared automatically when 7 days past. In 8 day it must clear binlogs. It still not deleted for me and still storage still climbing, but I trust it must clear shortly (today I guess)
As I told - SQL instance always in maintenance and can't be deleted from the gcloud console command or frontend. But this is interesting because I still can connect to the instance with the mysql command like mysql -u root -p -h 123.123.123.123. So, I just connected to the instance, deleted database which unused (or we can just use mysqldump to save current live database) and then I just deleted it. In the mysql logs (I'm using Stackdriver for this) I got a lot of messages like this: 2018-03-25T09:28:06.033206Z 25 [ERROR] Disk is full writing '/mysql/binlog/mysql-bin.034311' (Errcode: -255699248 - No space left on device). Waiting for someone to free space.... Let's me be this "someone".
When I deleted database it restarted and then it up. Viola. And now we have live instance. Now we can delete it/restore database on it/change storage for it.

Can't make kubernetes example for wordpress & mysql with persistent data work

I followed this kubernetes example to create a wordpress and mysql with persistent data
I followed everything from the tutorial from creation of the disk to deployment and on the first try deletion as well
1st try
https://s3-ap-southeast-2.amazonaws.com/dorward/2017/04/git-cmd_2017-04-03_08-25-33.png
Problem: persistent volumes does not bind to the persistent volume claim. It remains at pending status both for the creation of the pod and the volume claim. Volume status remains at Released state as well.
Had to delete everything as describe in the example and try again. This time I mounted the created volumes to an instance in the cluster, formatted the disk using ext4 fs then unmounted the disks.
2nd try
https://s3-ap-southeast-2.amazonaws.com/dorward/2017/04/git-cmd_2017-04-03_08-26-21.png
Problem: After formatting the volumes, they are now bound to the claims yay! unfortunately mysql pod doesn't run with status crashLoopback off. Eventually the wordpress pod crashed as well.
https://s3-ap-southeast-2.amazonaws.com/dorward/2017/04/git-cmd_2017-04-03_08-27-22.png
Did anyone else experience this? I'm wondering if I did something wrong or if something has changed from the write up of the exam til now that made the example break. How do I go around fixing it?
Any help is appreciated.
Get logs for pods:
kubectl logs pod-name
If log indicates the pods are not even starting (crashloopback) investigate the events in k8s:
kubectl get events
The event log indicates the node running out of memory (OOM):
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
1m 7d 1555 gke-hostgeniuscom-au-default-pool-xxxh Node Warning SystemOOM {kubelet gke-hostgeniuscom-au-default-pool-xxxxxf-qmjh} System OOM encountered
Trying a larger instance size should solve the issue.

Issue with container restart , Galera MariaDB stack on rancher

Hi I am trying to create a wordpress application using docker compose and I use the Galera MariaDB catalog entry from the rancher.
I can get all the set up working fine. I use external links and connect to the load balancer with some environment variable like this:
external_links:
r-galera_galera-lb_1:mysql
I can see the tables being replicated in the cluster, however if I reboot the machine, even after the stack becomes active again, I fail to launch the application.
I get the error like this:
> wordpress-docker-php-fpm | MySQL "CREATE DATABASE" Error: WSREP has not yet prepared node for application use
> wordpress-docker-php-fpm exited with code 1
When I remove the whole Galera Stack and make a new one I get my wordpress setup working again.
I had to come to this forum for this issue since I couldn't contact any maintainer of the catalog (there isn't any contact info). Can someone help in this regard ?
Hello Syed Alam Abbas,
the issue with your approach is = the cluster is not properly shut down and started. If you reboot your machine, the clusters nodes will go out of sync and store the latest state they are in. If you restart the machine and everything is back online - you have an unsynced cluster. You can follow this guide to recover your cluster.
The guide is pretty straight forward:
check the latest state of all your nodes "wsrep_last_committed" with SHOW STATUS LIKE 'wsrep_%';
Promote the node which has the most-up-to-date data to be the primary.

Tomcat Freezes just one application

I have an Ubuntu LAMJ server running Tomcat6.
One of my JSP applications freezes every couple of days and I am having trouble figuring out why. I have to reboot tomcat to get that one app going again, as it won't cone back on its own. I am getting nothing in my own log4j logs for that app, and can't see anything in Catalina.out either.
This applications shares a javax.sql.DataSource resource with another, via a context element in the server.xml file. I don't think this is the cause of the problem, but I may as well mention it.
Could anyone point me in the right direction to find the cause of this intermittent issue?
thanks in advance,
Christy
Get a Thread dump of the running server
There are two options
Use VisualVM
in your %java_home%/bin folder there will be a file called jvisualvm. Run this and connect to your tomcat server. Click the Threads tab and then "Thread Dump"
Manually from the Command Line
open up a command line and find the process id for your tomcat
ps -ef | grep java
Once you identify the process ID for the running tomcat instance,
kill -3 <pid>
replace the process Id here. This will send your thread dump to the stdout for your tomcat. Most likely catalina.out file.
edit - As per Mark's comments below:
It is normal to take 3 thread dumps ~10s apart and compare them. It
makes it much easier to see which threads are 'stuck' and which ones
are moving
Once you have the thread dump you can analyse it for stuck threads. It may not be stuck threads as the problem, but at least you can see what is going on inside the server to analyze the problem further.