Couchbase loses data after cluster node failure simulation - couchbase

I have 3 nodes in Couchbase cluster with number of replicas set to 1.
While performing a multithreaded insert of 1M documents, I restart one of the nodes couple of times.
The result is that at the end of insert operations, I am missing about 15% of the data.
Any idea how to prevent the data loss?

Firstly, did you failover the node when it went out of the cluster? Until you failover, the replica on the other nodes will not be promoted to active (and hence any replica data will not be accessible).
Secondly, are you checking the return value from your insert operations? If a node is inaccessible (but before a failover) operations will return an exception (likely "timeout") - you should ensure the application retries the insert.
Thirdly, by default most CRUD operations on Couchbase return as soon as the update has occurred on the master node for maximum performance. As a consequence if you do loose a node it's possible that the replica hasn't been written yet - so there would be no replica even if you did perform a failover. To prevent this you can use the observe operation to not report the operation "complete" until a replica node has a copy - see Monitoring Items using observe.
Note that using observe will result in a performance penalty, but this may be an acceptable tradeoff for modifications you particularly care about.

Related

Are there *application*-driven reasons to prefer multi-primary topologies over clustering, or vice-versa?

I have an application that currently uses a single primary and I'm looking to do multi-primary by either setting up a reciprocal multi-primary (just two primaries with auto-increment-increment and auto-increment-offset set appropriately) or Clustering-with-a-capital-C. The database is currently MariaDB 10.3, so the clustering would be Galera.
My understanding of multi-primary is that the application would likely require no changes: the application would connect to a single database (doesn't matter which one), and any transaction that needed to obtain any locks would do so locally, any auto-increment values necessary would be generated, and once a COMMIT occurs, that engine would complete the commit and the likelihood of failure-to-replicate to the other node would be very low.
But for Clustering, a COMMIT actually requires that the other node(s) are updated to ensure success, the likelihood of failure during COMMIT (as opposed to during some INSERT/UPDATE/DELETE) is much higher, and therefore the application would really require some automated retry logic to be built into it.
Is the above accurate, or am I overestimating the likelihood of COMMIT-failure in a Clustered deployment, or perhaps even underestimating the likelihood of COMMIT-failure in a multi-primary environment?
From what I've read, it seems that the Galera Cluster is a little more graceful about handling nodes leaving the re-joining the Cluster and adding new nodes. Is Galera Cluster really just multi-master with the database engine handling all the finicky setup and management, or is there some major difference between the two?
Honestly, I'm more looking for reassurance that moving to Galera Cluster isn't going to end up being an enormous headache relative to the seemingly "easier" and "safer" move to multi-primary.
By "multi-primary", do you mean that each of the Galera nodes would be accepting writes? (In other contexts, "multi-primary" has a different meaning -- and only one Replica.)
One thing to be aware of: "Critical read".
For example, when a user posts something and it writes to one node, and then that user reads from a different node, he expects his post to show up. See wsrep_sync_wait.
(Elaborating on Bill's comment.) The COMMIT on the original write waits for each other node to say "yes, I can and will store that data", but a read on the other nodes may not immediately "see" the value. Using wsrep_sync_wait just before a SELECT makes sure the write is actually visible to the read.

Master Slave Replication in Databases

As per How To Set Up Replication in MySQL,
Once the replica instance has been initialized, it creates two
threaded processes. The first, called the IO thread, connects to the
source MySQL instance and reads the binary log events line by line,
and then copies them over to a local file on the replica’s server
called the relay log. The second thread, called the SQL thread, reads
events from the relay log and then applies them to the replica
instance as fast as possible.
Isn't it contradictory to the theory of master-slave database replication in which the master copies data to the slaves?
Reliability. (A mini-history of MySQL's efforts.)
When a write occurs on the Primary, N+1 extra actions occur:
One write to the binlog -- this is to allow for any Replicas that happen to be offline (for any reason); they can come back later and request data from this file. (Also see sync_binlog)
N network writes, one per Replica. These are to get the data to the Replicas ASAP.
Normally, if you want more than a few Replicas, you can "fan out" through several levels, thereby allowing for an unlimited number of Replicas. (10 per level would give you 1000 Replicas in 3 layers.)
The product called Orchestrator carries this to an extra level -- the binlog is replicated to an extra server and the network traffic occurs from there. This offloads the Primary. (Booking.com uses it to handle literally hundreds of replicas.)
On the Replica's side the two threads were added 20 years ago because of the following scenario:
The Replica is busy doing one query at a time.
It gets busy with some long query (say an ALTER)
Lots of activity backs up on the Primary
The Primary dies.
Now the Replica finishes the Alter, but does not have anything else to work on, so it is very "behind" and will take extra time to "catch up" once the Primary comes back online.
Hence, the 2-thread Replica "helps" keep things in sync, but it is still not fully synchronous.
Later there was "semi-synchronous" replication and multiple SQL threads in the Replica (still a single I/O thread).
Finally, InnoDB Cluster and Galera became available to provide [effectively] synchronous replication. But they come with other costs.
"master-slave database replication in which the master copies data to the slaves" - it's just a concept - data from a leader is copied to followers. There are many options how this could be done. Some of those are the write ahead log replication, blocks replication, rows replication.
Another interesting approach is to use a replication system completely separate from the storage. An example for this would be Bucardo - replication system for PostgreSQL. In that case nighter leader or follower actually do work.

Is there any concept of load balancing in MySQL master-master architecture?

I am running a MySQL 5.5 Master-Slave setup. For avoiding too many hits on my master server, I am thinking of having one or may be more servers for MySQL and incoming requests will first hit the HAProxy and it accordingly forwards the requests either in round robin or any scheduling algorithm defined in HAProxy. So set up will be like -
APP -> API Gateaway/Server -> HAProxy -> Master Server1/Master Server2
So what can be pros and cons to this setup ?
Replication in MySQL is asynchronous by default, so you can't always assume that the replicas are in sync with their source.
If you intend to use your load-balancer to split writes over the two master instances, you could get into trouble with that because of MySQL's asynchronous replication.
Say you commit a row on master1 to a table that has a unique key. Then you commit a row with the same unique value to the same table on master2, before the change on master1 has been applied through replication. Both servers allowed the row to be committed, because as far as they knew, it did not violate the unique constraint. But then as replication tries to apply the change on each server, those changes do conflict with the row committed. This is called split-brain, and it's incredibly difficult to recover from.
If your load-balancer randomly sends some read queries to another instance, they might not return data that you just committed on the other instance. This is called replication lag.
This may or may not be a problem for your app, but it's likely that in your app, at least some of the queries require strong consistency, i.e. reading outdated results is not permitted. Other cases even with the same app may be more tolerant of some replication lag.
I wrote a presentation some years ago about splitting queries between source and replica MySQL instances: https://www.percona.com/sites/default/files/presentations/Read%20Write%20Split.pdf. The presentation goes into more details about the different types of tolerance for replication lag.
MySQL 8.0 has introduced a more sophisticated solution for all of these problems. It's called Group Replication, and it does its best to ensure that all instances are in sync all the time, so you don't have the risk of reading stale data or creating write conflicts. The downside of Group Replication is that to ensure no replication lag occurs, it may need to constrain your transaction throughput. In other words, COMMITs may be blocked until the other instances in the replication cluster respond.
Read more about Group Replication here: https://dev.mysql.com/doc/refman/8.0/en/group-replication.html
P.S.: Whichever solution you decide to pursue, I recommend you do upgrade your version of MySQL. MySQL 5.5 passed its end-of-life in 2018, so it will no longer get updates even for security flaws.

Writing into multiple MySQL databases async

I am using AWS RDS so database replication between regions are impossible.
My application written in PHP and deployed on all regions, i am looking for a fast and reliable way to achieve that.
I am going to make MySQL connections :
SET ##auto_increment_increment= NUMBER_OF_WRITEABLE_DATABASES;
SET ##auto_increment_offset = REGION_ID ;
so AI pk's will be unique all over regions.
And my current plan is keeping a query log table with fields => id,queries,status,user_id. It will log all insert,update,delete queries into queries field in same page load.
Status Codes:
Status 0 => not executed
Status 1 => successfully executed on all regions
Status 2 => failed
Status 3 => failed with affected rows not match
Example Row:
id=>1
queries=>
INSERT INTO PROFILES VALUES (1,{USER_ID},'Username','Email')##SEPERATOR##AFFECTED_COUNT
UPDATE USERS SET last_modified='2012-12...' where id={USER_ID}##SEPERATOR##AFFECTED_COUNT
status=0
user_id=>{USER_ID}
and there will be a daemon which reads records which status != 1 and will process them on all regions without commit , once all run without error it will commit or roll back in case of error.
That is what i thought and going to use.
My question is there any more decent/tested approach to that scenario or is there any problem about my approach.
thanks in advance
My initial thought is that you are going down the wrong path if you are trying to use RDS as a solution to enforce unique record ID's across multiple regions. I would think you might want to rethink your actual need for uniqueness across regions or enforce uniqueness using multiple columns (i.e. an autoincrement plus a region identifier). That could be read and put into some eventually consistent data store for read purposes.
You're making a commendable effort, but as the other commenters have stated, your solution isn't viable, for a number of reasons.
You don't really want to use auto_increment_offset and auto_increment_increment at the session level. You want to set those at the server level. If RDS won't let you do that, this is another reason why RDS is probably not the best solution.
If I came out and suggested that you deploy a global network of MySQL servers (EC2, not RDS) in a multi-master ring, where data replicates 1 => 2 => 3 => 4 => 1 and each server ignores incoming replication messages with its own server id, my fellow MySQL DBAs would accuse me of having lost my mind and setting you up for a difficult-to-manage situation; however, I am convinced that this would be a much easier solution than what you have proposed, because at least, then, the data would be changing around the world in pretty much the same order in which it actually changed -- which would reduce the likelihood of conflicting updates originating from multiple locations. MySQL replication is asynchronous, in the sense that server 1 does not wait for a transaction to be committed on server 2 before returning success to the client (indicating that the transaction has committed), but don't confuse that fact with the fact that it is sequential -- transactions are replicated on each server in the order in which they were committed. (New options in MySQL 5.6 allow some exceptions to this by with parallel replication threads, but that isn't significant to this discussion).
Since you have devised a scheme for avoiding conflicting auto-increment values, your bigger problems are likely to come from updates and deletes. In the scenario I just described, if server 2 deleted a record and server 4 deleted the same record at the same time, then server 4 would stop replicating incoming events when it received the delete from server 2, because the "rows affected" would have been different. Your scenario would similarly fail. The difference is that using actual MySQL replication, nothing happening after the conflicting event happened, so until you resolved that conflict, at least your data would not diverge any further into inconsistency because of the sequential nature discussed above and the fact that MySQL replication completely stops whenever a conflict is encountered. In a ring of master servers, the server that has stopped replicating continues collecting a log of replication events from the upstream systems, but execution halts and the data on that server is frozen unless changed locally until the conflict is resolved and replication restarted.
Note also that in your scenario, you need to preserve "from" and "to" values for each column on updates, because you can't roll anything back unless you know that it rolls back to.
That being noted, a rollback needs to occur in real-time, not later. If I transfer money between two bank accounts, and for some reason that transfer needs to roll back, I need to see that while I'm using the bank's web site -- the bank can't roll that transaction back in the middle of the night just because one of their servers has a different balance in my bank account.
Here's a thought: In your scenario, it the account I was transferring "to" was consistent among all the servers, but the account I was transferring "from" was not, then I wonder... would your setup roll back the withdrawal from the "from" account, but leave the deposit in the "to" account? I think it might.
Keep in mind that you are limited by the CAP theorem. No system can be globally consistent, available, and tolerate isolation among the nodes. At best, you can pick any two.
With that thought, the question I have is this: why do all of the nodes in your global system need to be synchronized? If the main reason is performance, consider the possibility of deploying a single global master server, with read replicas distributed among the regions. Write your application with two pools of database connection threads so that most SELECT queries go to the local read replica, while INSERT, DELETE, UPDATE, and CALL (stored procedures that update data), are sent to the global master server. Your biggest worry, then, becomes the fact that you only have eventual consistency on the read replicas. With properly-sized servers and well-written queries, this is very fast (subject to the laws of physics for global travel of optical and electrical signals) but it is not instantaneous. What you have to do to accomplish this is for sessions that have recently made changes to the database, their reads may need to hit the global master -- if you place an order, you need to see the order immediately, so the master might be the best place to look, right away. Later, looking at the local replica will work. You're still out of scope for RDS with this, because of the cross-regional issue... but MySQL on EC2 is a good fit.
Read replicas impose a very small load on the master, but even this load can be mitigated by connecting a single read replica to the master and then connecting the downstream read replicas to that intermediate server.
Setting slave_compressed_protocol = 1 on the masters and the replicas will enable the machines to use compressed connections for transferring the replication events. I have found this to be anywhere from 3:1 to 10:1 depending on the nature of the data being replicated and the delay of compressing and decompressing the data seems insignificant.
Additionally, you could set up a second master, adjacent to the primary master (perhaps in a different A/Z), link those two servers with master-master replciation, chain the read replicas to the 2nd master, use auto increment increment and offsets appropriately, but do not write to or read from to the second master under normal conditions. Why would you do this? This way, you have a 2nd global master that could be placed into service immediately in case of failure of the primary master by redirecting your application to access it.
Of course, the nature of your application plays a large factor in how much global integration is actually required. Solving this problem will require you to rethink how the application works, to determine whether architectural changes are needed.
As a DBA, I don't like some of the restrictions and flexibility constraints that RDS imposes on me. All I really get in return for the loss-of-control is a relative ease of backups and point-in-time restoration... which I like... but, to me, these don't make up for the restrictions.
Footnote: In the 3rd paragraph, I said "transactions are replicated on each server in the order in which they were committed." But that doesn't necessarily mean in the real-world wall-clock actual-order in which they were committed... it actually means the order in which they were committed to each server relative to the other transactions being committed by that server... so a transaction on Server #1 that actually committed before a different transaction on Server #3 might arrive at server #4 after the transaction from #3 instead of before it, depending on how long the transaction took to propagate through server #2 and be committed on server #3. However, this is still "true enough" in principle, because if the transaction on #1 is perceived at server #3 as conflicting with whatever happened on #3, it will not actually replicate to #4 because #3 will stop replicating.

Does the MySQL NDB Cluster consider node distance? Will it use the replicates if they are nearer?

I'm building a very small NDB cluster with only 3 machines. This means that machine 1 will serve as both MGM Server, MySQL Server, and NDB data node. The database is only 7 GB so I plan to replicate each node at least once. Now, since a query might end up using data that is cached in the NDB node on machine one, even if it isn't node the primary source for that data, access would be much faster (for obvious reasons).
Does the NDB cluster work like that? Every example I see has at least 5 machines. The manual doesn't seem to mention how to handle node differences like this one.
There are a couple of questions here :
Availability / NoOfReplicas
MySQL Cluster can give high availability when data is replicated across 2 or more data node processes. This requires that the NoOfReplicas configuration parameter is set to 2 or greater. With NoOfReplicas=1, each row is stored in only one data node, and a data node failure would mean that some data is unavailable and therefore the database as a whole is unavailable.
Number of machines / hosts
For HA configurations with NoOfReplicas=2, there should be at least 3 separate hosts. 1 is needed for each of the data node processes, which has a copy of all of the data. A third is needed to act as an 'arbitrator' when communication between the 2 data node processes fails. This ensures that only one of the data nodes continues to accept write transactions, and avoids data divergence (split brain). With only two hosts, the cluster will only be resilient to the failure of one of the hosts, if the other host fails instead, the whole cluster will fail. The arbitration role is very lightweight, so this third machine can be used for almost any other task as well.
Data locality
In a 2 node configuration with NoOfReplicas=2, each data node process stores all of the data. However, this does not mean that only one data node process is used to read/write data. Both processes are involved with writes (as they must maintain copies), and generally, either process could be involved in a read.
Some work to improve read locality in a 2-node configuration is under consideration, but nothing is concrete.
This means that when MySQLD (or another NdbApi client) is colocated with one of the two data nodes, there will still be quite a lot of communication with the other data node.