AWS Aurora MySQL Blue/Green Deployment replication failure - mysql

I wanted to use the new fully managed Blue/Green deployments to upgrade Aurora 1 databases (MySQL 5.6) to Aurora 2 (MySQL 5.7). Though it worked great on one pre-production environment the replication fails on other environments, including production. The replication fails with errors like:
Read Replica Replication Error - SQLError: 1032, reason: Could not execute Delete_rows event on table my_schema.my_table; Can't find record in 'my_table', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin-changelog.000122, end_log_pos 2040
On other instances there were duplicate entries for the primary key (Error code 1062). I tried ROW and MIXED as binlog_format. AWS support's recommendation is to solve those problems manually, which does seem inpractical. For debugging purposes, I tried to read the binlog with the mysqlbinlog utility. This led to the following error:
ERROR: Could not construct log event object: Found invalid event in binary log
Did anybody encounter similar issues? Is there a way to get more insight into these errors and solve them?

Related

Replication failed with error code 1062 using Google Database Migration Service on big database (~800G)

I'm trying to create an MySQL 5.7.35 read-only replica on GCP from an external origin. The database is enourmous, with aproximately 800G of data.
I have already ajusted the definer on the triggers, views and functions in a way that GCP accepts (root#%) and thefore the full-dump that the Database Migration Service makes worked. Also got the replication working with the schema of this database (no data).
So far made just one attempt with data. On this attempt the full-dump was sucesful (took 2 days and 10 hours) and failed some time after the replication started with the following error:
2021-09-05T06:09:33.293123Z 2 [ERROR] Slave SQL for channel '': Could not execute Write_rows event on table pacsdb.content_item; Duplicate entry '1441957' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000005, end_log_pos 78621021, Error_code: 1062
Selecting this row on the replica returned the same data of the origin (the row was already there).
Since I can't stop slave, skip_counter and start slave or something like that on GCP I have to figure out why this is happening.
My next step would be try to make the dump manualy using the flags that Google recommends.
Someone had a similar problem or have a clue why this is happening?
Any tips are apreciated, thx!
Activating the consistency warnings and GTID-based replication should work. There is information relating replication with Global Transaction Identifiers for MySQL 5.7 here [1].
[1] - https://dev.mysql.com/doc/refman/5.7/en/replication-gtids.html

Cannot start debezium MySQL connector due to Error code 1236

When I check the status of my debezium connector via the kakfa-connect's REST API, I see this error message for the connector:
org.apache.kafka.connect.errors.ConnectException: The slave is
connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the
master has purged binary logs containing GTIDs that the slave
requires. Error code: 1236; SQLSTATE: HY000.\n\tat
io.debezium.connector.mysql.AbstractReader.wrap(AbstractReader.java:230)\n\tat
io.debezium.connector.mysql.AbstractReader.failed(AbstractReader.java:197)\n\tat
io.debezium.connector.mysql.BinlogReader$ReaderThreadLifecycleListener.onCommunicationFailure(BinlogReader.java:997)\n\tat
com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:950)\n\tat
com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:580)\n\tat
com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:825)\n\tat
java.lang.Thread.run(Thread.java:748)\nCaused by:
com.github.shyiko.mysql.binlog.network.ServerException: The slave is
connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the
master has purged binary logs containing GTIDs that the slave
requires.\n\tat
com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:914)\n\t...
3 more\n
Is this an issue with how I am configuring my debezium connector or an issue with MySQL? Whats crazy is that even when I tried setting the option snapshot.mode to never and this error is still being thrown! According to the documentation, when snapshot.mode is set to either never or when_needed it should not require the GTID so I am super confused as to what is happening
The problem is that Debezium was probably down for some time and some of the transactions it has not seen are no longer available on the server.
That could be an issue with the wrong offsets for the connector.
So I've deleted the connector, deleted all related Kafka topics (like schema history, etc), and cleaned the offsets using the following guide https://debezium.io/documentation/faq/#how_to_remove_committed_offsets_for_a_connector
And it helped! After re-creation - the connector works as expected now.

Communication link failure: 1047 WSREP has not yet prepared node for application use in

We had a three-node cluster with MariaDB 10.4. We had an outage and the servers all rebooted with one having an irrecoverable network issue at the time.
We set up another server and added it to the cluster as a third member later.
However, ever since that, we have constantly been getting this error every now and then.
*3287799 FastCGI sent in stderr: "PHP message: An Error occurred while handling another error:
PDOException: SQLSTATE[08S01]: Communication link failure: 1047 WSREP has not yet prepared node for application use in /var/....yii2/db/Command.php:1293
In order to fix this issue, we turned down all three nodes one by one and then re-initialized the cluster, even with a new cluster name and all.
The first one was started with "galera_new_cluster" and the remaining two were added to this cluster. However, we still kept getting the same error intermittently.
The workaround at mariadb galera - Error when a node shutdown ERROR 1047 WSREP has not yet prepared node for application use was followed but that didn't do anything, as expected.
Next, what we did is set up a single fresh server and installed the new 10.5.X MariaDB server on it. Took backup from the old cluster using mariabackup and restored it onto this new single server.
This single server was set up as a new cluster with fresh details and everything. We wanted to run it as a single node cluster to make sure if the error still persisted. Oddly enough, the error is still there and it comes off every half an hour or so.
Has anyone got any clue what could be the reason for this weird issue we're facing? Currently, we don't know what exactly is the issue which is why we're facing a hard time solving it.
Any help would be greatly appreciated.
Update:
We turned off galera on this single-node cluster and ran it as a simple stand-alone mariadb server. However we still go the same errors in our web-server's logs. This is bonkers.
Any idea? Anyone?

aws DMS replicate-changes-only error

I have prod aws Aurora DB and I want
to replicate changes to test mysql DB (schema is same - Aurora is based on mysql)
I am using aws DMS for this.
When performing full replication for certain tables the replication works fine,
When I want to perform replicate-changes-only, the replication fails.
I've set binlog_checksum=NONE and binlog_format=ROW in the parameter group.
The error I am receiving while running is:
Last Error The task stopped abnormally Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
Last Error Task 'task-id' was suspended due to 6 successive unexpected failures Stop Reason FATAL_ERROR Error Level FATAL
Loading a snapshot to the test db isn't an option.
I just want to replicate the changes between specific tables.
Thanks in advance.
I am having the same error, it was always stopping 10min after starting. Adding more verbose logs didn't show more information but by changing the task configuration, especially the parameter MaxFullLoadSubTasks.
By default the value is "MaxFullLoadSubTasks": 8,, I changed it to "MaxFullLoadSubTasks": 1,.
It is slower but it's working for now. You may be able to increase it a bit to be quicker without having the same error.
You can modify the task configuration by first copying the task json settings you will find under DMS > TASK > overview, then changing the value and saving it to a file and then:
aws dms modify-replication-task --replication-task-arn <TASK_ARN_ID> --replication-task-settings file:///path/to/your/task_config.json

MySQL Cluster ERROR 1296 (HY000): Got error 157 'Unknown error code' from NDBCLUSTER

Today my datacenter had a breaker fail which resulted in my servers losing power. I'm running a 4 node MySQL cluster. I restarted the cluster, first the management nodes, then the data nodes, then after the data nodes were running I started the SQL nodes. I then checked the cluster with ndb_mgm -e SHOW. Everything seemed fine until I tried running a query. I got this error,
ERROR 1296 (HY000): Got error 157 'Unknown error code' from NDBCLUSTER
I check the MySQL logs and could not find any errors. I then tried a full shutdown and restart of the MySQL cluster and checked config between the shutdown and start. Everything seemed to check out. I then ran a query on another database using the NDBCLUSTER engine. The query was successful. I've tried searching google but no one seems to have any answers that help. I've checked the config, I've made sure ndbd is running on the data nodes, ect. The other databases seem to be working fine except this one. I have a backup of the database but I would much preferably recover the database if possible.
If anyone has any suggestion or ideas, it would be greatly appreciated.
Thanks in advance.
Error 157 is actually 'could not connect to storage engine' and the fact that MySQL fails to report that error correctly is a bug: http://bugs.mysql.com/bug.php?id=44817
The case described in that bug mentions that you get the error when you try to query a table in NDB when the cluster is still down.
So I'm just guessing, but I would conclude that your cluster is not started. Either you missed starting one of the nodes, or else something went wrong starting one of the nodes.
Check the mysql server is really connected to the NDB storage.
Do from a mysql server that should be connected to NDB
SHOW GLOBAL STATUS LIKE 'Ndb_cluster_node_id';
Is the answer > 0?
SHOW GLOBAL STATUS LIKE 'Ndb_number_of_data_nodes';
Is the answer > 0 ?
If not, then the mysql server is not connected and then i would recommend you check your firewall and /etc/hosts table and make sure you dont have a line like:
127.0.0.1 localhost ..
Best regards
Johan