We have a Percona Xtradb cluster with about 11 nodes. One of the nodes crashed about 2 days ago, but now failing to start again even after the donor indicates that the SST process is complete and the node has now JOINED the cluster.
When i check the log of the crashed node that fails to start, i keep seeing this error repeatedly (in intervals of hours):
[Warning] WSREP: Failed to report last committed [xxxxxx] -4 (Interrupted
system call)
But before and after this message pops up in the error log once after some hours, the only line being logged is:
....
2015-10-19 11:23:48 9091 [Note] WSREP: (f771e66c, 'tcp://0.0.0.0:4567') address 'tcp://192.168.2.100:4567' pointing to uuid f771e66c is blacklisted, skipping
2015-10-19 11:23:48 9091 [Note] WSREP: (f771e66c, 'tcp://0.0.0.0:4567') address 'tcp://192.168.2.100:4567' pointing to uuid f771e66c is blacklisted, skipping
2015-10-19 11:23:48 9091 [Note] WSREP: (f771e66c, 'tcp://0.0.0.0:4567') address 'tcp://192.168.2.100:4567' pointing to uuid f771e66c is blacklisted, skipping
[Warning] WSREP: Failed to report last committed [xxxxxx] -4 (Interrupted
system call)
2015-10-19 11:23:48 9091 [Note] WSREP: (f771e66c, 'tcp://0.0.0.0:4567') address 'tcp://192.168.2.100:4567' pointing to uuid f771e66c is blacklisted, skipping
2015-10-19 11:23:48 9091 [Note] WSREP: (f771e66c, 'tcp://0.0.0.0:4567') address 'tcp://192.168.2.100:4567' pointing to uuid f771e66c is blacklisted, skipping
2015-10-19 11:23:48 9091 [Note] WSREP: (f771e66c, 'tcp://0.0.0.0:4567') address 'tcp://192.168.2.100:4567' pointing to uuid f771e66c is blacklisted, skipping
....
What might be causing this to happen? And why won't this node start again? And how can i fix the node, start it, and have it join the cluster again?
Related
Debian 10, Maridadb 10.3.26, Galera-3 25.3.31
I have a three node cluster. The nodes are named node3, node4, and node5. Node3 gets disconnected from the cluster on occasion.
If it picks node5 to recover from I get
2020-11-18 19:42:08 7 [Note] WSREP: Requesting state transfer: success, donor: 2
2020-11-18 19:42:08 7 [Note] WSREP: GCache history reset: 57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:0 -> 57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:75720363
2020-11-18 19:42:08 17 [Note] WSREP: SST received: 57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:75696989
2020-11-18 19:42:08 17 [Note] WSREP: wsrep_start_position set to '57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:75696989'
2020-11-18 19:42:08 7 [Note] WSREP: Receiving IST: 23374 writesets, seqnos 75696989-75720363
2020-11-18 19:42:08 0 [Note] WSREP: 2.0 (node5): State transfer to 0.0 (node3) complete.
2020-11-18 19:42:08 0 [Note] WSREP: Member 2.0 (node5) synced with group.
2020-11-18 19:42:08 0 [Note] WSREP: (23249d11, 'tcp://0.0.0.0:4567') turning message relay requesting off
2020-11-18 19:42:15 0 [Warning] WSREP: Protocol violation. JOIN message sender 2.0 (node5) is not in state transfer (SYNCED). Message ignored.
after which node3 will sit forever never changing wsrep_ready to yes.
On the other hand, if node3 picks node4 I get all the same sort of messages except
[Warning] WSREP: Protocol violation. JOIN message sender 2.0 (node5) is not in state transfer (SYNCED). Message ignored.
does not appear and eventually node3 wsrep_ready becomes yes and the node starts to process queries.
Any idea how I much figure out the issue?
Here is some more data. This is an example of a successful join when it chooses node4 instead of node5:
2020-11-19 21:12:54 7 [Note] WSREP: State transfer required:
Group state: 57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:75815331
Local state: 57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:75696989
2020-11-19 21:12:54 7 [Note] WSREP: REPL Protocols: 9 (4, 2)
2020-11-19 21:12:54 7 [Note] WSREP: New cluster view: global state: 57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:75815331, view# 349: Primary, number of nodes: 3, my index: 2, protocol version 3
2020-11-19 21:12:54 7 [Warning] WSREP: Gap in state sequence. Need state transfer.
2020-11-19 21:12:56 7 [Note] WSREP: Prepared SST request: mysqldump|10.4.44.82:3360
2020-11-19 21:12:56 7 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-11-19 21:12:56 7 [Note] WSREP: Assign initial position for certification: 75815331, protocol version: 4
2020-11-19 21:12:56 0 [Note] WSREP: Service thread queue flushed.
2020-11-19 21:12:56 7 [Note] WSREP: IST receiver addr using tcp://x.y.z.a:4568
2020-11-19 21:12:56 7 [Note] WSREP: Prepared IST receiver, listening at: tcp://x.y.z.a:4568
2020-11-19 21:12:56 0 [Note] WSREP: Member 2.0 (node3) requested state transfer from '*any*'. Selected 0.0 (node4)(SYNCED) as donor.
2020-11-19 21:12:56 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 75815331)
2020-11-19 21:12:56 7 [Note] WSREP: Requesting state transfer: success, donor: 0
2020-11-19 21:12:56 7 [Note] WSREP: GCache history reset: 57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:0 -> 57b37aa2-d111-11e8-a015-ab6cf5f3b3ea:75815331
2020-11-19 21:12:56 0 [Note] WSREP: (fcbfdc45, 'tcp://0.0.0.0:4567') turning message relay requesting off
I have three node setup and running perfectly for the past months.
Recently I want to add another node in a different location but somehow I keep on getting errors.
At first, I was just following this tutorial (where I setup the first time few months ago) https://www.howtoforge.com/tutorial/how-to-install-and-configure-galera-cluster-on-ubuntu-1604/ I did not start all the nodes again from the beginning, I just has to find the file of /mysql/conf.d/galera.cnf in the other three nodes I added the new nodes ip into the previous three. So for the forth one I had the /etc/mysql/conf.d/galera.cnf setup like...
[mysqld]
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
# Galera Provider Configuration
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
# Galera Cluster Configuration
wsrep_cluster_name="galera_cluster"
wsrep_cluster_address="gcomm://node1_ip,node2_ip,node3_ip,node4_ip"
# Galera Synchronization Configuration
wsrep_sst_method=rsync
# Galera Node Configuration
wsrep_node_address="xx.xx.xxx.xxx"
wsrep_node_name="Node4"
somehow I am getting this HUGE error,
Group state: e3ade7e7-e682-11e7-8d16-be7d28cda90e:36273
Local state: 00000000-0000-0000-0000-000000000000:-1
[Note] WSREP: New cluster view: global state: e3ade7e7-e682-11e7-8d16-be7d28cda90e:36273, view# 122: Primary, number of nodes: 4, my
[Warning] WSREP: Gap in state sequence. Need state transfer.
[Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address 'xxx.node.4.ip' --datadir '/var/lib/mysql/' --parent '22828' ''
rsyncd version 3.1.1 starting, listening on port 4444
[Note] WSREP: Prepared SST request: rsync|xxx.node.4.ip:4444/rsync_sst
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] WSREP: REPL Protocols: 7 (3, 2)
[Note] WSREP: Assign initial position for certification: 36273, protocol version: 3
[Note] WSREP: Service thread queue flushed.
[Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
[Note] WSREP: Member 0.0 (Node4) requested state transfer from '*any*'. Selected 1.0 (Node1)(SYNCED) as donor.
[Note] WSREP: Shifting PRIMARY -> JOINER (TO: 36273)
[Note] WSREP: Requesting state transfer: success, donor: 1
[Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> e3ade7e7-e682-11e7-8d16-be7d28cda90e:36273
[Note] WSREP: (7642cf37, 'tcp://0.0.0.0:4567') connection to peer 7642cf37 with addr tcp://xxx.node.4.ip:4567 timed out, no messages
[Note] WSREP: (7642cf37, 'tcp://0.0.0.0:4567') turning message relay requesting off
mariadb.service: Start operation timed out. Terminating.
Terminated
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 22875
sent 0 bytes received 0 bytes total size 0
WSREP_SST: [INFO] Joiner cleanup done.
[ERROR] WSREP: Process was aborted.
[ERROR] WSREP: Process completed with error: wsrep_sst_rsync --role 'joiner' --address 'xxx.node.4.ip' --datadir '/var/lib/mysql/'
[ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
[ERROR] WSREP: SST failed: 2 (No such file or directory)
[ERROR] Aborting
Error in my_thread_global_end(): 1 threads didn't exit
mariadb.service: Main process exited, code=exited, status=1/FAILURE
Failed to start MariaDB 10.1.33 database server.
P.S for the older 3 nodes maria db version is 10.1.29 and the new node is 10.1.33
Thanks in advance for any suggestions.
I have a galera cluster (10.0.27) with 3 nodes, each on a dedicated server.
After the reboot of one of the servers, the node cannot join the cluster anymore neither perform a full SST.
Actually, it's like mysql 'misses' to launch some commands.
I have a second 'development' cluster with the same configuration working perfectly, I have no problem to add a node. I noticed a difference between the working cluster and not working when I add back a node for a full SST :
Node joining on working cluster :
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Quorum results:
11:44:52 mysqld: #011version = 4,
11:44:52 mysqld: #011component = PRIMARY,
11:44:52 mysqld: #011conf_id = 8,
11:44:52 mysqld: #011members = 2/3 (joined/total),
11:44:52 mysqld: #011act_id = 906976,
11:44:52 mysqld: #011last_appl. = -1,
11:44:52 mysqld: #011protocols = 0/7/3 (gcs/repl/appl),
11:44:52 mysqld: #011group UUID = 27ba4c4f-9b78-11e6-824c-f3b1e60fa202
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Flow-control interval: [28, 28]
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 906976)
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: State transfer required:
11:44:52 mysqld: #011Group state: 27ba4c4f-9b78-11e6-824c-f3b1e60fa202:906976
11:44:52 mysqld: #011Local state: 00000000-0000-0000-0000-000000000000:-1
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: New cluster view: global state: 27ba4c4f-9b78-11e6-824c-f3b1e60fa202:906976, view# 9: Primary, number of nodes: 3, my index: 2, protocol version 3
11:44:52 mysqld: 170628 11:44:52 [Warning] WSREP: Gap in state sequence. Need state transfer.
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '192.***.***.**2' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '16472' --binlog '/var/log/mysql/mariadb-bin' '
**11:44:52 rsyncd[16514]: rsyncd version 3.1.1 starting, listening on port 4444**
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Prepared SST request: rsync|192.***.***.**2:4444/rsync_sst
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: REPL Protocols: 7 (3, 2)
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Assign initial position for certification: 906976, protocol version: 3
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Service thread queue flushed.
11:44:52 mysqld: 170628 11:44:52 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (27ba4c4f-9b78-11e6-824c-f3b1e60fa202): 1 (Operation not permitted)
11:44:52 mysqld: #011 at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Member 2.0 (server-3) requested state transfer from '*any*'. Selected 0.0 (server1)(SYNCED) as donor.
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 906977)
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: Requesting state transfer: success, donor: 0
11:44:52 mysqld: 170628 11:44:52 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(27ba4c4f-9b78-11e6-824c-f3b1e60fa202:906976)
11:44:52 rsyncd[16531]: name lookup failed for 192.***.***.**1: Name or service not known
11:44:52 rsyncd[16531]: connect from UNKNOWN (192.***.***.**1)
11:44:52 rsyncd[16531]: rsync to rsync_sst/ from UNKNOWN (192.***.***.**1)
11:44:52 rsyncd[16531]: receiving file list
11:44:54 rsyncd[16553]: name lookup failed for 192.***.***.**1: Name or service not known
11:44:54 rsyncd[16553]: connect from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16531]: sent 114 bytes received 146847600 bytes total size 146810880
11:44:54 rsyncd[16553]: rsync to rsync_sst-log_dir/ from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16553]: receiving file list
11:44:54 rsyncd[16553]: sent 63 bytes received 100688095 bytes total size 100663296
11:44:54 rsyncd[16559]: name lookup failed for 192.***.***.**1: Name or service not known
11:44:54 rsyncd[16559]: connect from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16560]: name lookup failed for 192.***.***.**1: Name or service not known
11:44:54 rsyncd[16560]: connect from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16561]: name lookup failed for 192.***.***.**1: Name or service not known
11:44:54 rsyncd[16561]: connect from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16562]: name lookup failed for 192.***.***.**1: Name or service not known
11:44:54 rsyncd[16562]: connect from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16559]: rsync to rsync_sst/./db_1 from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16562]: rsync to rsync_sst/./db_2 from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16560]: rsync to rsync_sst/./db_3 from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16561]: rsync to rsync_sst/./db_3 from UNKNOWN (192.***.***.**1)
11:44:54 rsyncd[16560]: receiving file list
...
Node joining on non-working cluster :
13:36:28 mysqld: 170630 13:36:28 [Note] WSREP: Quorum results:
13:36:28 mysqld: #011version = 4,
13:36:28 mysqld: #011component = PRIMARY,
13:36:28 mysqld: #011conf_id = 514,
13:36:28 mysqld: #011members = 2/3 (joined/total),
13:36:28 mysqld: #011act_id = 242914778,
13:36:28 mysqld: #011last_appl. = -1,
13:36:28 mysqld: #011protocols = 0/7/3 (gcs/repl/appl),
13:36:28 mysqld: #011group UUID = 8119e584-9f83-11e6-b292-7a8102156c2d
13:36:28 mysqld: 170630 13:36:28 [Note] WSREP: Flow-control interval: [28, 28]
13:36:28 mysqld: 170630 13:36:28 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 242914778)
13:36:28 mysqld: 170630 13:36:28 [Note] WSREP: State transfer required:
13:36:28 mysqld: #011Group state: 8119e584-9f83-11e6-b292-7a8102156c2d:242914778
13:36:28 mysqld: #011Local state: 00000000-0000-0000-0000-000000000000:-1
13:36:28 mysqld: 170630 13:36:28 [Note] WSREP: New cluster view: global state: 8119e584-9f83-1
1e6-b292-7a8102156c2d:242914778, view# 515: Primary, number of nodes: 3, my index: 2, protocol version 3
13:36:28 mysqld: 170630 13:36:28 [Warning] WSREP: Gap in state sequence. Need state transfer.
13:36:28 mysqld: 170630 13:36:28 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --add
ress '192.***.***.*11' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --pare
nt '13253' --binlog '/var/log/mysql/mariadb-bin' '
13:36:28 rsyncd[13316]: rsyncd version 3.1.1 starting, listening on port 4444
13:36:32 mysqld: 170630 13:36:32 [Note] WSREP: (85c5aae8, 'tcp://0.0.0.0:4567') turning messag
e relay requesting off
13:36:56 /etc/init.d/mysql[14935]: 0 processes alive and '/usr/bin/mysqladmin --defaults-file=
/etc/mysql/debian.cnf ping' resulted in
The difference is just after the line :
rsyncd[13316]: rsyncd version 3.1.1 starting, listening on port 4444
On the working cluster, the following line is
WSREP: Prepared SST request: rsync|192.***.***.**2:4444/rsync_sst
On the not working cluster this line doesn't appear, it's like the SST request is not made.
I can provide more information about the configuration if you think it can help to find the issue.
Thanks for your help !
Had the same issue, that's what I found:
The wsrep_sst_rsync is stuck in an endless loop. In my case, because the output of lsof -i :$rsync_port is empty. For some (unknown) reason, the lsof had the setgid bit set:
[dbserver1:~]# ls -l /usr/bin/lsof
-rwxr-sr-x 1 root root 163224 Oct 28 2015 /usr/bin/lsof
this caused an endless loop of wsrep_sst_rsync, as it checks if rsync could be started. Removing the flag causes the script to continue which eventually starts an SST.
The flag can be removed using:
[dbserver1:~]# chmod g-s /usr/bin/lsof
I'm sure there is a simple fix for this, but forgive me I'm new at PXC. I'm using rsync to transfer state of the bootstrapping node to node2. node2 is the node I am joining to the cluster. I originally tried Xtrabackup, but ran into problems which I will explore at another time. For now I'm using rsync for what I thought would be simplicity. If you scroll down to the [ERROR] you will see where the problems are causing the State Transfer to be interrupted. What could be causing this?
*WSREP_SST: **[ERROR]** find/rsync returned code 123: (20141228 02:24:40.505)
2014-12-28 02:24:40 9446 **[ERROR]** WSREP: Failed to read from: **wsrep_sst_rsync** --role 'donor' --address '192.168.70.2:4444/rsync_sst' --auth 'sst:sst' --socket*
2014-12-28 02:24:24 9446 [Note] WSREP: (2861d1d7, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2014-12-28 02:24:24 9446 [Note] WSREP: declaring 41f79045 at tcp://192.168.70.2:4567 stable
2014-12-28 02:24:24 9446 [Note] WSREP: Node 2861d1d7 state prim
2014-12-28 02:24:24 9446 [Note] WSREP: save pc into disk
2014-12-28 02:24:24 9446 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2014-12-28 02:24:24 9446 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 42495aba-8e30-11e4-9596-c702faf22ad0
2014-12-28 02:24:24 9446 [Note] WSREP: STATE EXCHANGE: sent state msg: 42495aba-8e30-11e4-9596-c702faf22ad0
2014-12-28 02:24:24 9446 [Note] WSREP: STATE EXCHANGE: got state msg: 42495aba-8e30-11e4-9596-c702faf22ad0 from 0 (node3)
2014-12-28 02:24:24 9446 [Note] WSREP: STATE EXCHANGE: got state msg: 42495aba-8e30-11e4-9596-c702faf22ad0 from 1 (node2)
2014-12-28 02:24:24 9446 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 1,
members = 1/2 (joined/total),
act_id = 0,
last_appl. = 0,
protocols = 0/6/3 (gcs/repl/appl),
group UUID = 48ec9889-8ddc-11e4-9efd-da6610fd24da
2014-12-28 02:24:24 9446 [Note] WSREP: Flow-control interval: [23, 23]
2014-12-28 02:24:24 9446 [Note] WSREP: New cluster view: global state: 48ec9889-8ddc-11e4-9efd-da6610fd24da:0, view# 2: Primary, number of nodes: 2, my index: 0, protocol version 3
2014-12-28 02:24:24 9446 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-12-28 02:24:24 9446 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-12-28 02:24:24 9446 [Note] WSREP: Service thread queue flushed.
2014-12-28 02:24:24 9446 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2014-12-28 02:24:24 9446 [Note] WSREP: Service thread queue flushed.
2014-12-28 02:24:25 9446 [Note] WSREP: Member 1.0 (node2) requested state transfer from '*any*'. Selected 0.0 (node3)(SYNCED) as donor.
2014-12-28 02:24:25 9446 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 0)
2014-12-28 02:24:25 9446 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-12-28 02:24:25 9446 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '192.168.70.2:4444/rsync_sst' --auth 'sst:sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --binlog 'mysql-bin' --gtid '48ec9889-8ddc-11e4-9efd-da6610fd24da:0''
2014-12-28 02:24:25 9446 [Note] WSREP: sst_donor_thread signaled with 0
2014-12-28 02:24:25 9446 [Note] WSREP: Flushing tables for SST...
2014-12-28 02:24:25 9446 [Note] WSREP: Provider paused at 48ec9889-8ddc-11e4-9efd-da6610fd24da:0 (5)
2014-12-28 02:24:25 9446 [Note] WSREP: Tables flushed.
WSREP_SST: [INFO] Preparing binlog files for transfer: (20141228 02:24:26.201)
mysql-bin.000015
2014-12-28 02:24:27 9446 [Note] WSREP: (2861d1d7, 'tcp://0.0.0.0:4567') turning message relay requesting off
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest1.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest2.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest3.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest4.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest5.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest6.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest7.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest8.ibd": Permission denied (13)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
rsync: send_files failed to open "/var/lib/mysql/mysql/innodb_index_stats.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/mysql/innodb_table_stats.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/mysql/slave_master_info.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/mysql/slave_relay_log_info.ibd": Permission denied (13)
rsync: send_files failed to open "/var/lib/mysql/mysql/slave_worker_info.ibd": Permission denied (13)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
WSREP_SST: **[ERROR]** find/rsync returned code 123: (20141228 02:24:40.505)
2014-12-28 02:24:40 9446 **[ERROR]** WSREP: Failed to read from: **wsrep_sst_rsync** --role 'donor' --address '192.168.70.2:4444/rsync_sst' --auth 'sst:sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --binlog 'mysql-bin' --gtid '48ec9889-8ddc-11e4-9efd-da6610fd24da:0'
2014-12-28 02:24:40 9446 **[ERROR]** WSREP: Process completed with error: wsrep_sst_rsync --role 'donor' --address '192.168.70.2:4444/rsync_sst' --auth 'sst:sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --binlog 'mysql-bin' --gtid '48ec9889-8ddc-11e4-9efd-da6610fd24da:0': 255 (Unknown error 255)
2014-12-28 02:24:40 9446 [Note] WSREP: resuming provider at 5
2014-12-28 02:24:40 9446 [Note] WSREP: Provider resumed.
2014-12-28 02:24:40 9446 **[ERROR]** WSREP: Command did not run: wsrep_sst_rsync --role 'donor' --address '192.168.70.2:4444/rsync_sst' --auth 'sst:sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --binlog 'mysql-bin' --gtid '48ec9889-8ddc-11e4-9efd-da6610fd24da:0'
2014-12-28 02:24:40 9446 [Warning] WSREP: 0.0 (node3): State transfer to 1.0 (node2) failed: -255 (Unknown error 255)
2014-12-28 02:24:40 9446 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 0)
2014-12-28 02:24:40 9446 [Note] WSREP: Member 0.0 (node3) synced with group.
2014-12-28 02:24:40 9446 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 0)
2014-12-28 02:24:40 9446 [Note] WSREP: Synchronized with group, ready for connections
2014-12-28 02:24:40 9446 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-12-28 02:24:41 9446 [Note] WSREP: forgetting 41f79045 (tcp://192.168.70.2:4567)
2014-12-28 02:24:41 9446 [Note] WSREP: Node 2861d1d7 state prim
2014-12-28 02:24:41 9446 [Note] WSREP: save pc into disk
2014-12-28 02:24:41 9446 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2014-12-28 02:24:41 9446 [Note] WSREP: forgetting 41f79045 (tcp://192.168.70.2:4567)
2014-12-28 02:24:41 9446 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4c2e3544-8e30-11e4-a0dc-d280b597b8d2
2014-12-28 02:24:41 9446 [Note] WSREP: STATE EXCHANGE: sent state msg: 4c2e3544-8e30-11e4-a0dc-d280b597b8d2
2014-12-28 02:24:41 9446 [Note] WSREP: STATE EXCHANGE: got state msg: 4c2e3544-8e30-11e4-a0dc-d280b597b8d2 from 0 (node3)
2014-12-28 02:24:41 9446 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 2,
members = 1/1 (joined/total),
act_id = 0,
last_appl. = 0,
protocols = 0/6/3 (gcs/repl/appl),
group UUID = 48ec9889-8ddc-11e4-9efd-da6610fd24da
2014-12-28 02:24:41 9446 [Note] WSREP: Flow-control interval: [16, 16]
2014-12-28 02:24:41 9446 [Note] WSREP: New cluster view: global state: 48ec9889-8ddc-11e4-9efd-da6610fd24da:0, view# 3: Primary, number of nodes: 1, my index: 0, protocol version 3
2014-12-28 02:24:41 9446 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-12-28 02:24:41 9446 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-12-28 02:24:41 9446 [Note] WSREP: Service thread queue flushed.
2014-12-28 02:24:41 9446 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2014-12-28 02:24:41 9446 [Note] WSREP: Service thread queue flushed.
2014-12-28 02:24:46 9446 [Note] WSREP: cleaning up 41f79045 (tcp://192.168.70.2:4567)
Something is wrong with your permissions of the datadirectory as you can see from the following errors:
rsync: send_files failed to open "/var/lib/mysql/sbtest/sbtest7.ibd": Permission denied (13)
Things I would check:
Are all data files in /var/lib/mysql owned by the mysql user?
do you have SELinux or AppArmor running? (It might be malconfigured).
Please note that a DONOR with rsync SST method will block reads/writes while it's the DONOR. This can cause downtime.
I suggest to look into Percona XtraBackup (The issues you have might be similar)
I am trying to get a mariadb cluster up and running but it is not working out for me. Right now I am using MariaDB Galera 5.5.36 on a 64 bit red hat ES6 machine. I installed mariadb through this repo here:
[mariadb]
name = MariaDB
baseurl = http://yum.mariadb.org/5.5-galera/rhel6-amd64/
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1
In the server.conf I have the following in server 1:
[mariadb]
log_error=/var/log/mariadb.log
query_cache_size=0
query_cache_type=0
binlog_format=ROW
default_storage_engine=innodb
innodb_autoinc_lock_mode=2
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.211.133
wsrep_cluster_name='cluster'
wsrep_node_address='192.168.211.132'
wsrep_node_name='cluster1'
wsrep_sst_method=rsync
and on server 2 I have
[mariadb]
log_error=/var/log/mariadb.log
query_cache_size=0
query_cache_type=0
binlog_format=ROW
default_storage_engine=innodb
innodb_autoinc_lock_mode=2
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.211.132
wsrep_cluster_name='cluster'
wsrep_node_address='192.168.211.133'
wsrep_node_name='cluster2'
wsrep_sst_method=rsync
When I start server 1 with the following command: sudo service mysql start --wsrep-new-cluster it starts up just fine, if I open up mysql and check the status of wsrep it says everything is up and running which is good but when I try to do sudo service mysql start on the second server I get the following in the error logs:
140609 14:47:55 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140609 14:47:56 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.i5qfm2' --pid-file='/var/lib/mysql/localhost.localdomain-recover.pid'
140609 14:47:57 mysqld_safe WSREP: Recovered position 85448d73-ebe8-11e3-9c20-fbc1995fee11:0
140609 14:47:57 [Note] WSREP: wsrep_start_position var submitted: '85448d73-ebe8-11e3-9c20-fbc1995fee11:0'
140609 14:47:57 [Note] WSREP: Read nil XID from storage engines, skipping position init
140609 14:47:57 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
140609 14:47:57 [Note] WSREP: wsrep_load(): Galera 25.3.2(r170) by Codership Oy <info#codership.com> loaded successfully.
140609 14:47:57 [Note] WSREP: CRC-32C: using hardware acceleration.
140609 14:47:57 [Note] WSREP: Found saved state: 85448d73-ebe8-11e3-9c20-fbc1995fee11:-1
140609 14:47:57 [Note] WSREP: Passing config to GCS: base_host = 192.168.211.133; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5
140609 14:47:57 [Note] WSREP: Assign initial position for certification: 0, protocol version: -1
140609 14:47:57 [Note] WSREP: wsrep_sst_grab()
140609 14:47:57 [Note] WSREP: Start replication
140609 14:47:57 [Note] WSREP: Setting initial position to 85448d73-ebe8-11e3-9c20-fbc1995fee11:0
140609 14:47:57 [Note] WSREP: protonet asio version 0
140609 14:47:57 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
140609 14:47:57 [Note] WSREP: backend: asio
140609 14:47:57 [Note] WSREP: GMCast version 0
140609 14:47:57 [Note] WSREP: (0c085f34-efe5-11e3-9f6b-8bfd1706e2a4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
140609 14:47:57 [Note] WSREP: (0c085f34-efe5-11e3-9f6b-8bfd1706e2a4, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
140609 14:47:57 [Note] WSREP: EVS version 0
140609 14:47:57 [Note] WSREP: PC version 0
140609 14:47:57 [Note] WSREP: gcomm: connecting to group 'cluster', peer '192.168.211.132:,192.168.211.134:'
140609 14:48:00 [Warning] WSREP: no nodes coming from prim view, prim not possible
140609 14:48:00 [Note] WSREP: view(view_id(NON_PRIM,0c085f34-efe5-11e3-9f6b-8bfd1706e2a4,1) memb {
0c085f34-efe5-11e3-9f6b-8bfd1706e2a4,0
} joined {
} left {
} partitioned {
})
140609 14:48:01 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50775S), skipping check
140609 14:48:31 [Note] WSREP: view((empty))
140609 14:48:31 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
140609 14:48:31 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
140609 14:48:31 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'cluster' at 'gcomm://192.168.211.132,192.168.211.134': -110 (Connection timed out)
140609 14:48:31 [ERROR] WSREP: gcs connect failed: Connection timed out
140609 14:48:31 [ERROR] WSREP: wsrep::connect() failed: 7
140609 14:48:31 [ERROR] Aborting
140609 14:48:31 [Note] WSREP: Service disconnected.
140609 14:48:32 [Note] WSREP: Some threads may fail to exit.
140609 14:48:32 [Note] /usr/sbin/mysqld: Shutdown complete
140609 14:48:32 mysqld_safe mysqld from pid file /var/lib/mysql/localhost.localdomain.pid ended
I am at a loss as to why the second server cannot detect that a cluster is up and running. These machines can communicate with each other just fine, I can SSH from one to the other and they can ping each other. I tried deleted the galera cache, tried downgrading my version of mariadb galera, tried disabling SELinux, tried running the mysql service as a different user, verified that the correct ports are open, tried running them on 2 VMs on separate computers with different IP addresses, etc. Does anyone have any idea what is going on here because I have been searching for 3 days trying to fix this but no solution seems to work with me.
Here is how I fixed my similar issue.
CentOS 7 w/ MariaDB Galera 10.1.
Node2 I saw this:
016-12-27 15:40:38 140703512762624 [Warning] WSREP: no nodes coming from prim view, prim not possible
After doing some reading, I tried running this on node1.
service mysql start --wsrep-new-cluster
But this failed, and in the logs, I found this...
2016-12-27 15:44:08 140438853814528 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .
So I edited the file /var/lib/mysql/grastate.dat, changing safe_to_bootstrap to 1.
I was then able to start the Primary node using:
service mysql start --wsrep-new-cluster
Then on the others, I just used
service mysql start
Note: This was in a demo pre-production environment. I promptly broke it after getting everything to work by rebooting all servers at the same time :P, but I knew there were no writes, and that the DB's were in sync. If you are in produciton and this happens, you can use the following to figure out which node to run "new-cluster" on, which is akin to saying, make me primary.
mysqld_safe --wsrep-recover
If this is a production issue, I highly reccomend reading this article and making a backup w/ CloneZilla before throwing commands at the broken clients!
https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/
The cluster must start with this command on primary node:
galera_new_cluster
after starting first node, you can start other nodes in the cluster successfully.
I believe you need to list all the IPs in the wsrep_cluster_address parameter.
wsrep_cluster_address=gcomm://192.168.211.132,192.168.211.133
This should be done on both hosts. BTW you likely want three nodes not two as to avoid split brain scenarios.