I had a 4 node couchbase cluster with 3 buckets each having 1 replica. However when one of my nodes when down a part of my dataset became inaccessible. I thought this might be because of the fact that I have an even number of nodes i.e 4 ( instead of say 3 or 5) and so I failed over 1 node. I then proceeded to rebalance the cluster at which point it got stuck. The only thing I can find in the logs is Bucket "staging" rebalance does not seem to be swap rebalance. Any idea how to recover from this ?
In my desperation I also tried changing the replicas of different buckets and then performing a rebalance. Nothing worked. This has happened once before as well, that time I had to dump my whole database out and load it into a brand new cluster because I couldn't even backup my database. This time that path is not an option since the data is critical and uptime is also important.
Couchbase support pointed to a bug where if there are empty vbuckets, rebalancing can hang. As per them, this is fixed in 2.0 but this is not !!!!.
The work around solution is to populate buckets with a minimum of 2048 short time to live (TTL >= (10 minutes per upgrade + (2 x rebalance_time)) x num_nodes) items so all vbuckets have something in them. We then populated all buckets successfully and were able to restart the rebalance process which completed fine.
This works for Couchbase 3.0.
Reference: http://www.pythian.com/blog/couchbase-rebalance-freeze-issue/
Related
I'm trying to figure out what could account for the very large performance difference between my dev environment (5 year old laptop) and our stage server (azure cloud). The table in question is a log table of web service requests for a service that processes XML. One of the columns in the table is the XML passed to the web service.
On my local computer it basically doesn't matter how many rows are in the table; performance is great. On the deployed server if there are more than a couple hundred rows then performance starts tanking quickly. A "select count(*)" on this table when it has 2000 rows in it will take 0.0017 seconds locally but close to 20 on the server. Even a simple insert of a new row takes a significant amount of time; on the order of whole seconds.
I found this article while researching the problem explanation of MySQL block performance. That makes sense to me and I'd be happy to implement the 1-to-1 solution but I don't want to do it until I understand why it's working fine locally and tanking on the server.
Are there some MySQL setting variables I can check to find the differences? I'd really like to get my local computer to have the same performance issue as the deployed so I can validate that the fix will work.
EDIT:
The create table queries are identical. MySQL versions are 5.7.23 and 5.7.22. I did notice that the buffer is 16x bigger on my local. Gonna try and get the server updated to the setting my local has and see if that resolves the issue.
The solution was updating the buffer pool size like Rick suggested.
I have the following problem.
Using REST, I am getting binary content (BLOBs) from a MySql database via a NodeJS Express app.
All works fine, but I am having issues scaling the solution.
I increased the number of NodeJS instances to 3 : they are running ports 4000,4001,4002.
On the same machine I have Nginx installed and configured to do a load balancing between my 3 instances.
I am using Apache Bench to do some perf testing.
Please see attached pic.
Assuming I have a dummy GET REST that goes to the db, reads the blob (roughly 600KB in size) and returns it back (all http), I am making 300 simultaneous calls. I would have thought that using nginx to distribute the requests would make it faster, but it does not.
Why is this happening?
I am assuming it has to do with MySql?
My NodeJs app is using a connection pool with a limit set to 100 connections. What should be the relation between this value and the max connection value in Mysql? If I increase the connection pool to a higher number of connections, I get worse results.
Any suggestion on how to scale?
Thanks!
"300 simultaneous" is folly. No one (today) has the resources to effectively do more than a few dozen of anything.
4 CPU cores -- If you go much beyond 4 threads, they will be stumbling over each over trying to get CPU time.
1 network -- Have you check to see whether your big blobs are using all the bandwidth, thereby being the bottleneck?
1 I/O channel -- Again, lots of data could be filling up the pathway to disk.
(This math is not quite right, but it makes a point...) You cannot effectively run any faster than what you can get from 4+1+1 "simultaneous" connections. (In reality, you may be able to, but not 300!)
The typical benchmarks try to find how many "connections" (or whatever) leads to the system keeling over. Those hard-to-read screenshots say about 7 per second is the limit.
I also quibble with the word "simultaneous". The only thing close to "simultaneous" (in your system) is the ability to use 4 cores "simultaneously". Every other metric involves sharing of resources. Based on what you say, ...
If you start about 7 each second, some resource will be topped out, but each request will be fast (perhaps less than a second)
If you start 300 all at once, they will stumble over each other, some of them taking perhaps minutes to finish.
There are two interesting metrics:
How many per second you can sustain. (Perhaps 7/sec)
How long the average (and, perhaps, the 95% percentile) takes.
Try 10 "simultaneous" connections and report back. Try 7. Try some other small numbers like those.
I've set up a couchbase cluster with 2 nodes containing 300k docs on 4 buckets. the option replicas is forced to 1 as there are only 2 machines.
But documents are splitted half in one node half in the other, I need to have double copy of each document so if a node goes down the other one che still supply all data to my app.
Is there a setting I missed in creating the cluster?
can I still set the cluster to replicate all documents?
I hope someone can help.
thanks
PS: I'm using couchbase community 4.5
UPDATE:
I add screenshots of cluster web interface and cbstast output:
the following is the state with one node only
next the one with both node up:
then cbstats results on both node when both are up and running:
AS you can see with only one node there are half items displayed. Does it mean that the other half resides as replicas but are not shown???
can I still run consistenly my app with only one node???
UPDATE:
I had to click fail-over manually to see replicas become active on the remaining node. As with just two cluster auto fail-over is disabled!!!
Couchbase Server will partition or shard the documents across the two nodes, as you observed. It will also place replicas on those nodes, based on your one-replica configuration.
To access a replica, you must use one of the Client SDKs.
For example, this Java code will attempt to retrieve a replica (getFromReplica("id", ReplicaMode.ALL)) if the active document retrieval fails (get("id")).
bucket.async()
.get("id")
.onErrorResumeNext(bucket.async().getFromReplica("id", ReplicaMode.ALL))
.subscribe();
The ReplicaMode.ALL tells Couchbase to try all nodes with replicas and the active node.
So what was happening with only two nodes in the cluster was that auto fail-over didn't start automatically as specified here:
https://developer.couchbase.com/documentation/server/current/clustersetup/automatic-failover.html
this means data replicas where not activated in the remaining node unless fail-over was triggerd manullay.
The best thing is to have more than TWO nodes in the cluster before going in production.
To be honest I should have ridden documentation very carefully before asking any question.
thanks Jeff Kurtz for your help, you pushed me towards the solution. (the understanding of how couchbase replicas policy works).
I have worked through a number of quota issues in trying to stand up a 30 node click-to-deploy cassandra cluster. The next issue is that the data disks are not becoming available within the 300 seconds allotted in wait-for-disk.sh.
I've tried several times in us-central1-b, once in us-central1-a and the results range from half of the disks up to 24 of 30. The disks eventually all show up, so no quota issue here, just the timing as far as I can tell.
I've been able to ssh into one node and nearly figure out which steps to run, setting up required env vars and running the steps in /gagent/. I've gotten the disk mounted and configured and get cassandra started but the manually repaired node is still missing from the all-important CASSANDRA_NODE_VIEW_NAME and I must be missing some services because I still can't run cqlsh on the manually repaired node.
It's a bit tedious to set up this way but I could complete the cluster this way manually. Do I need to get it added to the view? How? Or is there a way to specify a longer timeout in wait-for-disk.sh? I'd be willing to wait a pretty long time over doing the remaining setup manually.
We'll look at updating the disk wait value for the next release. Thanks for the feedback! You should be able to join the Cassandra cluster manually after running the install scripts in /gagent. Let me know if you're still having trouble.
So what's the idea behind a cluster?
You have multiple machines with the same copy of the DB where you spread the read/write? Is this correct?
How does this idea work? When I make a select query the cluster analyzes which server has less read/writes and points my query to that server?
When you should start using a cluster, I know this is a tricky question, but mabe someone can give me an example like, 1 million visits and a 100 million rows DB.
1) Correct. Every data node does not hold a full copy of the cluster data, but every single bit of data is stored on at least two nodes.
2) Essentially correct. MySQL Cluster supports distributed transactions.
3) When vertical scaling is not possible anymore, and replication becomes impractical :)
As promised, some recommended readings:
Setting Up Multi-Master Circular Replication with MySQL (simple tutorial)
Circular Replication in MySQL (higher-level warnings about conflicts)
MySQL Cluster Multi-Computer How-To (step-by-step tutorial, it assumes multiple physical machines, but you can run your test with all processes running on the same machine by following these instructions)
The MySQL Performance Blog is a reference in this field
1->your 1st point is correct in a way.But i think if multiple machines would share the same data it would be replication instead of clustering.
In clustering the data is divided among the various machines and there is horizontal partitioning means the dividing of the data is based on the rows,the records are divided by using some algorithm among those machines.
the dividing of data is done in such a way that each record will get a unique key just as in case of a key-value pair and each machine also has a unique machine_id related which is used to define which key value pair would go to which machine.
we call each machine a cluster and each cluster consists of an individual mysql-server, individual data and a cluster manager.and also there is a data sharing between all the cluster nodes so that all the data is available to the every node at any time.
the retrieval of data is done through memcached devices/servers for fast retrieval and
there is also a replication server for a particular cluster to save the data.
2->yes, there is a possibility because there is a sharing of all the data among all the cluster nodes. and also you can use a load balancer to balance the load.But the idea of load balancer is quiet common because they are being used by most of the servers. but if you are trying you just for your knowledge then there is no need because you will not get to notice the type of load that creates the requirement of a load balancer the cluster manager itself can do the whole thing.
3->RandomSeed is right. you do feel the need of a cluster when your replication becomes impractical means if you are using the master server for writes and slave for reads then at some time when the traffic becomes huge such that the sever would not be able to work smoothly then you will feel the need of clustering. simply to speed up the whole process.
this is not the only case, this is just one of the scenario this is only just a case.
hope this is helpful for you!!