Issue with container restart , Galera MariaDB stack on rancher - containers

Hi I am trying to create a wordpress application using docker compose and I use the Galera MariaDB catalog entry from the rancher.
I can get all the set up working fine. I use external links and connect to the load balancer with some environment variable like this:
external_links:
r-galera_galera-lb_1:mysql
I can see the tables being replicated in the cluster, however if I reboot the machine, even after the stack becomes active again, I fail to launch the application.
I get the error like this:
> wordpress-docker-php-fpm | MySQL "CREATE DATABASE" Error: WSREP has not yet prepared node for application use
> wordpress-docker-php-fpm exited with code 1
When I remove the whole Galera Stack and make a new one I get my wordpress setup working again.
I had to come to this forum for this issue since I couldn't contact any maintainer of the catalog (there isn't any contact info). Can someone help in this regard ?

Hello Syed Alam Abbas,
the issue with your approach is = the cluster is not properly shut down and started. If you reboot your machine, the clusters nodes will go out of sync and store the latest state they are in. If you restart the machine and everything is back online - you have an unsynced cluster. You can follow this guide to recover your cluster.
The guide is pretty straight forward:
check the latest state of all your nodes "wsrep_last_committed" with SHOW STATUS LIKE 'wsrep_%';
Promote the node which has the most-up-to-date data to be the primary.

Related

Google compute engine, instance dead? How to reach?

I have a small instance running in GCE, had some troubles with the MongoDb so after some tries decided to reset the instance. But... it didn't seem to come back online. So i stopped the instance and restarted it.
It is an Bitnami MEAN stack which starts apache and stuff at startup.
But... i can't reach the instance! No SCP, no SSH, no webservice running. When i try to connect via SSH (in GCE) it times out, cant make connection on port 22. In the information it says 'The instance is booting up and sshd is not running yet', which is possible of course.... But i cant reach the instance in no possible manner not even after an hour wait :) Not sure what's happening if i cant connect to it somehow :(
There is some activity in the console... some CPU usage, mostly 0%, some incomming traffic but no outgoing...
I hope someone can give me a hint here!
Update 1
After the helpfull tip form Serhii... if found this in the logs...
Booting from Hard Disk 0...
[ 0.872447] piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1: Inodes that were part of a corrupted orphan linked list found.
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
fsck exited with status code 4
The root filesystem on /dev/sda1 requires a manual fsck
Update 2...
So, i need to fsck the drive...
Created a snapshot, made a new disk from that snapshot, added the new disk as an extra disk to another instance. Now that instance wont boot with the same problem... removing the extra disk fixed it again. So adding the disk makes it crash even though it isn't the boot-disk?
First, have a look at the Compute Engine -> VM instances -> NAME_OF_YOUR_VM -> Logs -> Serial port 1 (console) and try to find errors and warnings that could be connected to lack of free space or SSH. It'll be helpful if you updated your post by providing this information. In case if your instance run out of free space follow this instructions.
You can try to connect to your VM via Serial console by following this guide, but keep in mind that:
The interactive serial console does not support IP-based access
restrictions such as IP whitelists. If you enable the interactive
serial console on an instance, clients can attempt to connect to that
instance from any IP address.
more details you can find in the documentation.
Have a look at the Troubleshooting SSH guide and Known issues for SSH in browser. In addition, Google provides a troubleshooting script for Compute Engine to identify issues with SSH login/accessibility of your Linux based instance.
If you still have a problem try to use your disk on a new instance.
EDIT It looks like your test VM is trying to boot from the disk that you created from the snapshot. Try to follow this guide.
If you still have a problem, you can try to recreate the boot disk from a snapshot to resize it.

Google Cloud SQL instance always in Maintenance status & Binary logs issue

I've had some of Google Cloud SQL MySQL 2nd Gen 5.7 instances with failover replications. Recently I noticed that the one of the instance overloaded with the storage overloaded with binlogs and old binlogs not deleted for some reason.
I tried restart this instance but it wont start since 17 March.
Normal process with binlogs on other server:
Problem server. Binlogs not clearing and server wont start and always under maintenance in the gcloud console.
Also I created one other server with same configuration and not binlogs never clearing. I have already 5326 binlogs here when on normal server I have 1273 binlogs and they are clearing each day.
What I tried with the problem server:
1 - delete it from the Google Cloud Platform frontend. Response: The instance id is currently unavailable.
2 - restart it with the gcloud command. Response: ERROR: (gcloud.sql.instances.restart) HTTPError 409: The instance or operation is not in an appropriate state to handle the request. Same response on any other command which I sent with the gcloud.
Also I tried to solve problem with binlogs to configure with expire_logs_days option, but it seems this option not support by google cloud sql instance.
After 3 days of digging I found a solution. Binlogs must cleared automatically when 7 days past. In 8 day it must clear binlogs. It still not deleted for me and still storage still climbing, but I trust it must clear shortly (today I guess)
As I told - SQL instance always in maintenance and can't be deleted from the gcloud console command or frontend. But this is interesting because I still can connect to the instance with the mysql command like mysql -u root -p -h 123.123.123.123. So, I just connected to the instance, deleted database which unused (or we can just use mysqldump to save current live database) and then I just deleted it. In the mysql logs (I'm using Stackdriver for this) I got a lot of messages like this: 2018-03-25T09:28:06.033206Z 25 [ERROR] Disk is full writing '/mysql/binlog/mysql-bin.034311' (Errcode: -255699248 - No space left on device). Waiting for someone to free space.... Let's me be this "someone".
When I deleted database it restarted and then it up. Viola. And now we have live instance. Now we can delete it/restore database on it/change storage for it.

Can't delete google cloud sql replication master instance

I decided to play around with Google Could SQL and I setup a test sql instance, loaded it with some data and then setup replication on it in the google dev console. I did my testing and found out it all works great, the master/slave setup works as it should and my little POC was a success. So now I want to delete the POC sql instances but that's not going so well.
I deleted the replica instance fine (aka the 'slave') but for some reason the master instance still thinks there is a slave and therefore will not let me delete it. For example I run the following command in the gclound shell:
gcloud sql instances delete MY-INSTANCE-NAME
I get the following message:
ERROR: (gcloud.sql.instances.delete) The requested operation is not valid for a replication master instance.
This screenshot also shows that in the google dev console it clearly thinks there are no replicas attached to this instance (because I deleted them) but when I run:
gcloud sql instances describe MY-INSTANCE-NAME
It shows that there is a replica name still attached to the instance.
Any ideas on how to delete this for good? Kinda lame to keep on paying for this when it was just a POC that I want to delete (glad I didn't pick a high memory machine!)
Issue was on Google's side and they fixed it. Here were the sequence of events that led to the issue happening:
1) Change master's tier
2) Promote replica to master while the master tier change is in progress
Just had the same problem using GCloud. Deleting the failover replica first and then the master instance worked for me.

Couchbase stuck at re-balancing 0 nodes

I have a Couchbase cluster of 3 nodes. 2 of them went down briefly and came back up, after which I started re-balancing from couchbase console.
However, the console is showing Rebalancing 0 nodes for past 28 hours. I tried to stop the rebalance but it's stuck with the same message.
When I used command line tool to stop the rebalance I get the following message:
$ /opt/couchbase/bin/couchbase-cli rebalance-stop -c 127.0.0.1:8091 -u my-admin -p my-password
$ SUCCESS: rebalance cluster stopped
Yet, it's not actually stopped and the popover in console is still there. Is there any way I can fix it? I have already tried restarting both servers (that are stuck in pending).
EDIT:
I ended up pushing data on a different cluster using XDCR and then shutting down the entire cluster (even restarting all nodes didn't work). Some data was lost.
I'm keeping this open in case anyone has a better solution for such situation.
Looks like this is a known issue with Couchbase since at least version 2.5.1 through version 4.0.0:
https://forums.couchbase.com/t/rebalance-stuck-at-0-and-does-not-cancel/6568
Looks like you need to know how their erlang setup works and how to restart ns_server from there.
Have you tried to remove the node from the cluster and then re-add it to the cluster?

percona toolkit Replication filters error

I put the percona toolkit onto my DB hosts so I could try and deal with a problem with mysql going silently out of sync. That is replication seems fine on all nodes. Slave IO running / Slave SQL running and 0 seconds behind master.
I have 4 dbs setup in master/master on the first two, and two slaves, I'm using MariaDB-server-10.0.21 for the MySQL database on each node.
Yet the content of the wiki I run on them seems to go out of sync even with those positive indicators. For instance, you'll create a page, save it, get the thumbs up from the wiki. Then reload the page and the content will be gone! Then you point the wiki config to look at each db one at a time, reload the page. Until you find the db that saved the changes you made.
Then dump that db, stop the slaves on each host one at time and then import that version of the database. It's a real pain!
So I installed the percona toolkit after reading an article on how to solve this problem.
And when I run the pt-table-checksum command I get this error, saying Replication filters are set on these hosts:
[root#db1:~] #pt-table-checksum --replicate=test.checksum --databases=sean --ignore-tables=semaphore localhost
10-17T00:31:11 Replication filters are set on these hosts:
db3
binlog_do_db = jfwiki,jokefire,bacula,mysql
db2
binlog_do_db = jfwiki,jokefire,bacula,mysql
db4
binlog_do_db = jfwiki,jokefire,bacula,mysql
Please read the --check-replication-filters documentation to learn how to solve this problem. at /bin/pt-table-checksum line 9644.
But that EC2 host it claims that it's having trouble contacting equates to my 4th database host. I found out by ssh'ing in as my user to that DNS address. And I have no trouble at all logging into that host on the command line using mysql:
Can someone please explain what does this error mean, and how can I fix the issue? Is there any general advice you can give for mysql replication falling silently out of sync?
Thanks
Some of the pt tools need to create their own database and have it replicated. Your binlog_do_db prevents the extra db from being replicated, hence preventing that tool from working.
While you have the binlog_do removed, see what db it being built. Then add it.