Kubernetes Galera: WSREP: failed to open gcomm backend connection: 110: - mysql

I am trying to setup a Kubernetes Galera 3 Replica set. I am using:
https://github.com/kubernetes/kubernetes/tree/master/test/e2e/testing-manifests/statefulset/mysql-galera
The first pod spins up fine, but the second pod gets stuck:
1 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():162
2018-07-21 18:24:40 1 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2018-07-21 18:24:40 1 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1379: Failed to open channel 'mysql' at 'gcomm://mysql-0.galera.mysql.svc.cluster.local,mysql-1.galera.mysql.svc.cluster.local': -110 (Connection timed out)
2018-07-21 18:24:40 1 [ERROR] WSREP: gcs connect failed: Connection timed out
2018-07-21 18:24:40 1 [ERROR] WSREP: wsrep::connect(gcomm://mysql-0.galera.mysql.svc.cluster.local,mysql-1.galera.mysql.svc.cluster.local) failed: 7
2018-07-21 18:24:40 1 [ERROR] Aborting
Do I need to have etcd setup for this cluster to work? Any suggestions would be appreciated.
Thank you!
Kubernetes Info:
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Related

galera_new_cluster: command not found

I have installed 5 x MariaDB databases 10.2.8 (MariaDB database Server) on Redhat 6. I edited my.cnf to start with a cluster with
sudo service rh-mariadb-mariadb start --wsrep-new-cluster, but starting failed on all the nodes
If I try with te command
sudo galera_new_cluster: command not found
Errors:
2018-08-15 17:44:24 140204929021920 [Warning] WSREP: access
file(/var/opt/(...)lib/mysql//gvwstate.dat) failed(No such file or directory)
2018-08-15 17:44:24 140204929021920 [Note] WSREP: restore pc from disk failed
2018-08-15 17:44:24 140204929021920 [Warning] WSREP: (15ee6eeb, 'tcp://0.0.0.0:4567') address 'tcp://(...):4567' points to own listening address, blacklisting
2018-08-15 17:44:27 140204929021920 [Note] WSREP: (15ee6eeb, 'tcp://0.0.0.0:4567') connection to peer 15ee6eeb with addr tcp://(...):4567 timed out, no messages seen in PT3S
2018-08-15 17:44:27 140204929021920 [Warning] WSREP: no nodes coming from prim view, prim not possible
2018-08-15 17:44:27 140204929021920 [Note] WSREP: view(view_id(NON_PRIM,15ee6eeb,1) memb {
15ee6eeb,0} joined {} left {} partitioned {})
2018-08-15 17:44:57 140204929021920 [Note] WSREP: view((empty))
2018-08-15 17:44:57 140204929021920 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():158
2018-08-15 17:44:57 140204929021920 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2018-08-15 17:44:57 140204929021920 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'cluster' at 'gcomm://(...)': -110 (Connection timed out)
2018-08-15 17:44:57 140204929021920 [ERROR] WSREP: gcs connect failed: Connection timed out
2018-08-15 17:44:57 140204929021920 [ERROR] WSREP: wsrep::connect(gcomm://(...)) failed: 7
2018-08-15 17:44:57 140204929021920 [ERROR] Aborting
Spending a day now to try to solve the errors.
Anyone can help?

mariadb, add 4th galera node failed

I have three node setup and running perfectly for the past months.
Recently I want to add another node in a different location but somehow I keep on getting errors.
At first, I was just following this tutorial (where I setup the first time few months ago) https://www.howtoforge.com/tutorial/how-to-install-and-configure-galera-cluster-on-ubuntu-1604/ I did not start all the nodes again from the beginning, I just has to find the file of /mysql/conf.d/galera.cnf in the other three nodes I added the new nodes ip into the previous three. So for the forth one I had the /etc/mysql/conf.d/galera.cnf setup like...
[mysqld]
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
# Galera Provider Configuration
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
# Galera Cluster Configuration
wsrep_cluster_name="galera_cluster"
wsrep_cluster_address="gcomm://node1_ip,node2_ip,node3_ip,node4_ip"
# Galera Synchronization Configuration
wsrep_sst_method=rsync
# Galera Node Configuration
wsrep_node_address="xx.xx.xxx.xxx"
wsrep_node_name="Node4"
somehow I am getting this HUGE error,
Group state: e3ade7e7-e682-11e7-8d16-be7d28cda90e:36273
Local state: 00000000-0000-0000-0000-000000000000:-1
[Note] WSREP: New cluster view: global state: e3ade7e7-e682-11e7-8d16-be7d28cda90e:36273, view# 122: Primary, number of nodes: 4, my
[Warning] WSREP: Gap in state sequence. Need state transfer.
[Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address 'xxx.node.4.ip' --datadir '/var/lib/mysql/' --parent '22828' ''
rsyncd version 3.1.1 starting, listening on port 4444
[Note] WSREP: Prepared SST request: rsync|xxx.node.4.ip:4444/rsync_sst
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] WSREP: REPL Protocols: 7 (3, 2)
[Note] WSREP: Assign initial position for certification: 36273, protocol version: 3
[Note] WSREP: Service thread queue flushed.
[Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
[Note] WSREP: Member 0.0 (Node4) requested state transfer from '*any*'. Selected 1.0 (Node1)(SYNCED) as donor.
[Note] WSREP: Shifting PRIMARY -> JOINER (TO: 36273)
[Note] WSREP: Requesting state transfer: success, donor: 1
[Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> e3ade7e7-e682-11e7-8d16-be7d28cda90e:36273
[Note] WSREP: (7642cf37, 'tcp://0.0.0.0:4567') connection to peer 7642cf37 with addr tcp://xxx.node.4.ip:4567 timed out, no messages
[Note] WSREP: (7642cf37, 'tcp://0.0.0.0:4567') turning message relay requesting off
mariadb.service: Start operation timed out. Terminating.
Terminated
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 22875
sent 0 bytes received 0 bytes total size 0
WSREP_SST: [INFO] Joiner cleanup done.
[ERROR] WSREP: Process was aborted.
[ERROR] WSREP: Process completed with error: wsrep_sst_rsync --role 'joiner' --address 'xxx.node.4.ip' --datadir '/var/lib/mysql/'
[ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
[ERROR] WSREP: SST failed: 2 (No such file or directory)
[ERROR] Aborting
Error in my_thread_global_end(): 1 threads didn't exit
mariadb.service: Main process exited, code=exited, status=1/FAILURE
Failed to start MariaDB 10.1.33 database server.
P.S for the older 3 nodes maria db version is 10.1.29 and the new node is 10.1.33
Thanks in advance for any suggestions.

Percona mysql xtradb cluster doesn't start properly and node restarts don't work

tl;dr
When starting a fresh percona cluster of 3 kubernetes pods, the grastate.dat seq_no is set at -1 and doesn't change. On deleting one pod and watching it restart, expecting it to rejoin the cluster, it sets it's inital position to 00000000-0000-0000-0000-000000000000:-1 and tries to connect to itself (it's former ip), maybe because it'd been the first pod in the cluster? It then timeouts in it's erroneous connection to itself:
2017-03-26T08:38:05.374058Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
The cluster doesn't get started properly and I'm unable to successfully restart pods in the cluster.
Full
When I start the cluster from scratch. With blank data directories and a fresh etcd cluster, everything seems to come up. However I look at the grastate.dat and I find that the seq_no for each pod is -1:
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
At this point I can do mysql -h percona -u wordpress -p and connect and wordpress works too.
Scenario:
I have 3 percona pods
/ # jonathan#ubuntu:~/Projects/k8wp$ kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 1 12h
etcd-1 1/1 Running 0 12h
etcd-2 1/1 Running 3 12h
etcd-3 1/1 Running 1 12h
percona-0 1/1 Running 0 8m
percona-1 1/1 Running 0 57m
percona-2 1/1 Running 0 57m
When I try to restart percona-0 it gets kicked out of the cluster on restarting, percona-0's gvwstate.dat file shows
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/gvwstate.dat
my_uuid: b7571ff8-11f8-11e7-bd2d-8b50487e1523
#vwbeg
view_id: 3 b7571ff8-11f8-11e7-bd2d-8b50487e1523 3
bootstrap: 0
member: b7571ff8-11f8-11e7-bd2d-8b50487e1523 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend
The other 2 pods in the cluster show:
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/gvwstate.dat
my_uuid: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a
#vwbeg
view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
bootstrap: 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/gvwstate.dat
my_uuid: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a
#vwbeg
view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
bootstrap: 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend
Here are what I think are the relevant errors from percona-0's startup:
2017-03-26T08:37:58.370605Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:'
2017-03-26T08:38:01.373345Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
2017-03-26T08:38:01.373682Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-03-26T08:38:01.373750Z 0 [Note] WSREP: view(view_id(NON_PRIM,b7571ff8,5) memb {
b7571ff8,0
} joined {
} left {
} partitioned {
})
2017-03-26T08:38:01.373838Z 0 [Note] WSREP: gcomm: connected
2017-03-26T08:38:01.373872Z 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-03-26T08:38:01.373987Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2017-03-26T08:38:01.374012Z 0 [Note] WSREP: Opened channel 'wordpress-001'
2017-03-26T08:38:01.374108Z 0 [Note] WSREP: Waiting for SST to complete.
2017-03-26T08:38:01.374417Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2017-03-26T08:38:01.374469Z 0 [Note] WSREP: Flow-control interval: [16, 16]
2017-03-26T08:38:01.374491Z 0 [Note] WSREP: Received NON-PRIMARY.
2017-03-26T08:38:01.374560Z 1 [Note] WSREP: New cluster view: global state: :-1, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version -1
The ip it's trying to connect to 10.52.0.26 in 2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:' is actually that pods previous ip, here's the listing of keys in etcd I did before deleting percona-0
/ # etcdctl ls --recursive
/pxc-cluster
/pxc-cluster/wordpress
/pxc-cluster/queue
/pxc-cluster/queue/wordpress
/pxc-cluster/queue/wordpress-001
/pxc-cluster/wordpress-001
/pxc-cluster/wordpress-001/10.52.1.46
/pxc-cluster/wordpress-001/10.52.1.46/ipaddr
/pxc-cluster/wordpress-001/10.52.1.46/hostname
/pxc-cluster/wordpress-001/10.52.2.33
/pxc-cluster/wordpress-001/10.52.2.33/ipaddr
/pxc-cluster/wordpress-001/10.52.2.33/hostname
/pxc-cluster/wordpress-001/10.52.0.26
/pxc-cluster/wordpress-001/10.52.0.26/hostname
/pxc-cluster/wordpress-001/10.52.0.26/ipaddr
After kubectl delete pods/percona-0:
/ # etcdctl ls --recursive
/pxc-cluster
/pxc-cluster/queue
/pxc-cluster/queue/wordpress
/pxc-cluster/queue/wordpress-001
/pxc-cluster/wordpress-001
/pxc-cluster/wordpress-001/10.52.1.46
/pxc-cluster/wordpress-001/10.52.1.46/ipaddr
/pxc-cluster/wordpress-001/10.52.1.46/hostname
/pxc-cluster/wordpress-001/10.52.2.33
/pxc-cluster/wordpress-001/10.52.2.33/ipaddr
/pxc-cluster/wordpress-001/10.52.2.33/hostname
/pxc-cluster/wordpress
Also during the restart percona-0 tried to register to etcd with:
{"action":"create","node":{"key":"/pxc-cluster/queue/wordpress-001/00000000000000009886","value":"10.52.0.27","expiration":"2017-03-26T08:38:57.980325718Z","ttl":60,"modifiedIndex":9886,"createdIndex":9886}}
{"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/ipaddr","value":"10.52.0.27","expiration":"2017-03-26T08:38:28.01814818Z","ttl":30,"modifiedIndex":9887,"createdIndex":9887}}
{"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/hostname","value":"percona-0","expiration":"2017-03-26T08:38:28.037188157Z","ttl":30,"modifiedIndex":9888,"createdIndex":9888}}
{"action":"update","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"expiration":"2017-03-26T08:38:28.054726795Z","ttl":30,"modifiedIndex":9889,"createdIndex":9887},"prevNode":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"modifiedIndex":9887,"createdIndex":9887}}
which doesn't work.
From the second member of the cluster percona-1:
2017-03-26T08:37:44.069583Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.52.0.26:4567
2017-03-26T08:37:45.069756Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') reconnecting to b7571ff8 (tcp://10.52.0.26:4567), attempt 0
2017-03-26T08:37:48.570332Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
2017-03-26T08:37:49.605089Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspecting node: b7571ff8
2017-03-26T08:37:49.605276Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspected node without join message, declaring inactive
2017-03-26T08:37:50.104676Z 0 [Note] WSREP: declaring c33d6a73 at tcp://10.52.2.33:4567 stable
New Info:
I restarted percona-0 again, and this time it somehow came up! After a few tries I realised the pod needs to restarted twice to come up i.e. after deleting it the first time, it comes up with the above errors, after deleting it the second time it comes up okay and syncs with the other members. Could this be because it was the first pod in the cluster?
I've tested deleting the other pods but they all come back up okay.
The issue only lies with percona-0.
Also;
Taking down all the pods at once, if my node was to crash, that's the situation where the pods don't come back up at all! I suspect it's because no state is saved to grastate.dat , i.e. seq_no remains -1 even though the global id may change, the pods exit with mysqld shutdown, and the following errors:
jonathan#ubuntu:~/Projects/k8wp$ kubectl logs percona-2 | grep ERROR
2017-03-26T11:20:25.795085Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:25.795276Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:25.795544Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
2017-03-26T11:20:25.795618Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:25.795645Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
2017-03-26T11:20:25.795693Z 0 [ERROR] Aborting
jonathan#ubuntu:~/Projects/k8wp$ kubectl logs percona-1 | grep ERROR
2017-03-26T11:20:27.093780Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:27.093977Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:27.094145Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.1.49': -110 (Connection timed out)
2017-03-26T11:20:27.094200Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:27.094227Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.1.49) failed: 7
2017-03-26T11:20:27.094247Z 0 [ERROR] Aborting
jonathan#ubuntu:~/Projects/k8wp$ kubectl logs percona-0 | grep ERROR
2017-03-26T11:20:52.040214Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:52.040279Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:52.040385Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
2017-03-26T11:20:52.040437Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:52.040471Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
2017-03-26T11:20:52.040508Z 0 [ERROR] Aborting
grastate.dat on deleting all pods:
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
root#gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
# GALERA saved state
version: 2.1
uuid: a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno: -1
safe_to_bootstrap: 0
No, gvwstate.dat
Fixed it with changing the entrypoint in the container to the following script:
#!/bin/bash
sed -i \"s|safe_to_bootstrap.*:.*|safe_to_bootstrap:1|1\" /var/lib/mysql/grastate.dat;
/entrypoint.sh --wsrep-new-cluster;
Thanks to https://www.claudiokuenzler.com/blog/494/galera-cluster-mysql-not-starting-failed-to-open-channel-reach-primary#.WNesDiF97Qo
The issue is, when restarting the 3 pods from a crash, they all hit the following error:
[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
What that means (summarizing from the the link), is that since all the pods are down the first pod (the pods are managed by a statefulset) comes up and tries to reconnect to the cluster but doesn't find any other pods it can connect to, so it goes down, the next pod comes up tries the same thing, hits the same error and goes down to etc etc
The solution is for the first pod to start a new cluster when it comes up then all the subsequent will come up and find a node to connect to. It'll still come up with all the data.
So with percona xtradb the docker container's entrypoint looks like:
exec mysqld --user=mysql --wsrep_cluster_name=$CLUSTER_NAME --wsrep_cluster_address="gcomm://$cluster_join" --wsrep_sst_method=xtrabackup-v2 --wsrep_sst_auth="xtrabackup:$XTRABACKUP_PASSWORD" --log-error=${DATADIR}error.log $CMDARG
So all I have to do to get the setup running is pass the earlier argument --wsrep-new-cluster to the /entrypoint.sh file like so:
/entrypoint.sh --wsrep-new-cluster
PS//
I tried the above at first alone but I ran into an error stating that to force a new cluster and bootstrap with that node I had to set safe_to_bootstrap from 0 to 1 in /var/lib/mysql/grastate.dat

Percona Xtradb - unable to start nodes after long downtime

We have a Percona Xtradb-v2 cluster set up with 3 nodes.
Everything was working and in synchronisation when we shut down nodes 2 and 3, leaving only node 1. The nodes stayed down for a week, during which time the database grew by 100GB in size.
When we attempted to restart the nodes 2 and 3, the startup failed during the initial SST, after less than a minute. I have tried completely removing the /var/lib/mysql and restarting but it has the same effect.
The error logs appear to show an issue with the initial SST, possibly due to the volume of data required to be transferred for the initial startup. We have sufficient disk space, and file permissions are correct. The xtrabackup package is installed and available (and worked previously anyway).
The logs show a 'no such file or directory'
Joiner Logs show:
Dec 15 01:21:51 xm1adb05 mysqld: #011Group state: 67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440
Dec 15 01:21:51 xm1adb05 mysqld: #011Local state: 00000000-0000-0000-0000-000000000000:-1
Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: New cluster view: global state: 67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440, view# 54: Primary, number of nodes: 2, my index: 1, protocol version 3
Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Warning] WSREP: Gap in state sequence. Need state transfer.
Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.23.40.115' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '13029' '' '
Dec 15 01:21:51 xm1adb05 mysqld: WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20161215 01:21:51.575)
Dec 15 01:21:51 xm1adb05 -wsrep-sst-joiner: Streaming with xbstream
Dec 15 01:21:51 xm1adb05 -wsrep-sst-joiner: Using socat as streamer
...
Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb): 1 (Operation not permitted)
Dec 15 01:21:51 xm1adb05 mysqld: #011 at galera/src/replicator_str.cpp:prepare_for_IST():507. IST will be unavailable.
...
Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: Member 1.0 (xm1adb05) requested state transfer from '*any*'. Selected 0.0 (xm1adb04)(SYNCED) as donor.
Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 5766440)
Dec 15 01:21:51 xm1adb05 mysqld: 2016-12-15 01:21:51 13029 [Note] WSREP: Requesting state transfer: success, donor: 0
Dec 15 01:21:51 xm1adb05 mysql-systemd: State transfer in progress, setting sleep higher
...
Dec 15 01:22:02 xm1adb05 -wsrep-sst-joiner: xtrabackup_checkpoints missing, failed innobackupex/SST on donor
Dec 15 01:22:02 xm1adb05 -wsrep-sst-joiner: Cleanup after exit with status:2
Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.23.40.115' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '13029' '' : 2 (No such file or directory)
Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory)
Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] WSREP: SST failed: 2 (No such file or directory)
Dec 15 01:22:02 xm1adb05 mysqld: 2016-12-15 01:22:02 13029 [ERROR] Aborting
Donor logs show:
Dec 15 01:22:02 xm1adb04 mysqld: 2016-12-15 01:22:02 6531 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.23.40.115:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid '67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440'
Dec 15 01:22:02 xm1adb04 mysqld: 2016-12-15 01:22:02 6531 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.23.40.115:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid '67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440': 22 (Invalid argument)
Dec 15 01:22:03 xm1adb04 mysqld: 2016-12-15 01:22:03 6531 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.23.40.115:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid '67e7e56d-8e95-11e6-a9d2-ce8abe8f95bb:5766440'
Similar actions successfully started the secondary nodes on another (much smaller) database, so it would seem that the size may be the issue.
Can anyone give some help on how we can initialise and re-start the additional nodes?
Just switch to rsync.
wsrep_sst_method = rsync
Once the node is in sync, stop it, switch to xtrabackup and start again.
XtraBackup failed in your case. You can get lead of XB failure by looking at XB generated log files.
BTW, you should check the XB log on donor node.
As per the short snippet XB reports (Invalid argument)
The solution to restart the nodes was to first restart the one remaining cluster member (1), then completely wipe the /var/lib/mysql on the joiners (2 and 3) before trying to restart and rejoin. This causes an SST and all worked.
The problem seems to be that the nodes 2 and 3 were partitioned on node 1, and so it was not allowing the SST to complete (I think maybe the final IST was denied and so the SST rolled back). Restarting node 1 seems to reset the partitioning and then the SST could complete.
We also had a rather small gcache.size which didn't help as there were a lot of writes going on in the database.
Later events showed that the SST seems to have failed due to issues with xtrabackup on the donor node. The xtrabackup process didn't like the my.cnf settings where we had a line duplicated. Fixing this and restarting the donor (to end the partitioning) has let things work.
Whoever looking for an answer, having similar errors especially at donor, in my case I had to open firewall ports for 4444, 4567 and 4568 across all the nodes.
Please take backup first of advanced Node that is Node 1 and restore on second and third then then try to restart second and third node ,that will solve your problems .
because while starting second node your donor is not receiving last LSN no from joiner because second node datadir is empty there is no data

MariaDB Galera Cluster set up problems

I am trying to get a mariadb cluster up and running but it is not working out for me. Right now I am using MariaDB Galera 5.5.36 on a 64 bit red hat ES6 machine. I installed mariadb through this repo here:
[mariadb]
name = MariaDB
baseurl = http://yum.mariadb.org/5.5-galera/rhel6-amd64/
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1
In the server.conf I have the following in server 1:
[mariadb]
log_error=/var/log/mariadb.log
query_cache_size=0
query_cache_type=0
binlog_format=ROW
default_storage_engine=innodb
innodb_autoinc_lock_mode=2
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.211.133
wsrep_cluster_name='cluster'
wsrep_node_address='192.168.211.132'
wsrep_node_name='cluster1'
wsrep_sst_method=rsync
and on server 2 I have
[mariadb]
log_error=/var/log/mariadb.log
query_cache_size=0
query_cache_type=0
binlog_format=ROW
default_storage_engine=innodb
innodb_autoinc_lock_mode=2
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.211.132
wsrep_cluster_name='cluster'
wsrep_node_address='192.168.211.133'
wsrep_node_name='cluster2'
wsrep_sst_method=rsync
When I start server 1 with the following command: sudo service mysql start --wsrep-new-cluster it starts up just fine, if I open up mysql and check the status of wsrep it says everything is up and running which is good but when I try to do sudo service mysql start on the second server I get the following in the error logs:
140609 14:47:55 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140609 14:47:56 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.i5qfm2' --pid-file='/var/lib/mysql/localhost.localdomain-recover.pid'
140609 14:47:57 mysqld_safe WSREP: Recovered position 85448d73-ebe8-11e3-9c20-fbc1995fee11:0
140609 14:47:57 [Note] WSREP: wsrep_start_position var submitted: '85448d73-ebe8-11e3-9c20-fbc1995fee11:0'
140609 14:47:57 [Note] WSREP: Read nil XID from storage engines, skipping position init
140609 14:47:57 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
140609 14:47:57 [Note] WSREP: wsrep_load(): Galera 25.3.2(r170) by Codership Oy <info#codership.com> loaded successfully.
140609 14:47:57 [Note] WSREP: CRC-32C: using hardware acceleration.
140609 14:47:57 [Note] WSREP: Found saved state: 85448d73-ebe8-11e3-9c20-fbc1995fee11:-1
140609 14:47:57 [Note] WSREP: Passing config to GCS: base_host = 192.168.211.133; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5
140609 14:47:57 [Note] WSREP: Assign initial position for certification: 0, protocol version: -1
140609 14:47:57 [Note] WSREP: wsrep_sst_grab()
140609 14:47:57 [Note] WSREP: Start replication
140609 14:47:57 [Note] WSREP: Setting initial position to 85448d73-ebe8-11e3-9c20-fbc1995fee11:0
140609 14:47:57 [Note] WSREP: protonet asio version 0
140609 14:47:57 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
140609 14:47:57 [Note] WSREP: backend: asio
140609 14:47:57 [Note] WSREP: GMCast version 0
140609 14:47:57 [Note] WSREP: (0c085f34-efe5-11e3-9f6b-8bfd1706e2a4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
140609 14:47:57 [Note] WSREP: (0c085f34-efe5-11e3-9f6b-8bfd1706e2a4, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
140609 14:47:57 [Note] WSREP: EVS version 0
140609 14:47:57 [Note] WSREP: PC version 0
140609 14:47:57 [Note] WSREP: gcomm: connecting to group 'cluster', peer '192.168.211.132:,192.168.211.134:'
140609 14:48:00 [Warning] WSREP: no nodes coming from prim view, prim not possible
140609 14:48:00 [Note] WSREP: view(view_id(NON_PRIM,0c085f34-efe5-11e3-9f6b-8bfd1706e2a4,1) memb {
0c085f34-efe5-11e3-9f6b-8bfd1706e2a4,0
} joined {
} left {
} partitioned {
})
140609 14:48:01 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50775S), skipping check
140609 14:48:31 [Note] WSREP: view((empty))
140609 14:48:31 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
140609 14:48:31 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
140609 14:48:31 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'cluster' at 'gcomm://192.168.211.132,192.168.211.134': -110 (Connection timed out)
140609 14:48:31 [ERROR] WSREP: gcs connect failed: Connection timed out
140609 14:48:31 [ERROR] WSREP: wsrep::connect() failed: 7
140609 14:48:31 [ERROR] Aborting
140609 14:48:31 [Note] WSREP: Service disconnected.
140609 14:48:32 [Note] WSREP: Some threads may fail to exit.
140609 14:48:32 [Note] /usr/sbin/mysqld: Shutdown complete
140609 14:48:32 mysqld_safe mysqld from pid file /var/lib/mysql/localhost.localdomain.pid ended
I am at a loss as to why the second server cannot detect that a cluster is up and running. These machines can communicate with each other just fine, I can SSH from one to the other and they can ping each other. I tried deleted the galera cache, tried downgrading my version of mariadb galera, tried disabling SELinux, tried running the mysql service as a different user, verified that the correct ports are open, tried running them on 2 VMs on separate computers with different IP addresses, etc. Does anyone have any idea what is going on here because I have been searching for 3 days trying to fix this but no solution seems to work with me.
Here is how I fixed my similar issue.
CentOS 7 w/ MariaDB Galera 10.1.
Node2 I saw this:
016-12-27 15:40:38 140703512762624 [Warning] WSREP: no nodes coming from prim view, prim not possible
After doing some reading, I tried running this on node1.
service mysql start --wsrep-new-cluster
But this failed, and in the logs, I found this...
2016-12-27 15:44:08 140438853814528 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .
So I edited the file /var/lib/mysql/grastate.dat, changing safe_to_bootstrap to 1.
I was then able to start the Primary node using:
service mysql start --wsrep-new-cluster
Then on the others, I just used
service mysql start
Note: This was in a demo pre-production environment. I promptly broke it after getting everything to work by rebooting all servers at the same time :P, but I knew there were no writes, and that the DB's were in sync. If you are in produciton and this happens, you can use the following to figure out which node to run "new-cluster" on, which is akin to saying, make me primary.
mysqld_safe --wsrep-recover
If this is a production issue, I highly reccomend reading this article and making a backup w/ CloneZilla before throwing commands at the broken clients!
https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/
The cluster must start with this command on primary node:
galera_new_cluster
after starting first node, you can start other nodes in the cluster successfully.
I believe you need to list all the IPs in the wsrep_cluster_address parameter.
wsrep_cluster_address=gcomm://192.168.211.132,192.168.211.133
This should be done on both hosts. BTW you likely want three nodes not two as to avoid split brain scenarios.