couchbase cbtransfer hangs at 75% while processing 4 million documents - couchbase

I am currently facing a problem while trying to to do cbtransfer from a node to the filesystem (this is to backup the existing data). The total number of documents in the bucket is 4387914. Out of these the cbtransfer process always stops after writing 3289800 documents to the file. The file size stands at 1.5 GB and the progress of the cbtransfer is always shown as 75%.
I have already tried to use the cbb_max_mb and batch_max_size by setting them to lower values 10 and 10 respectively but the freeze happens at the same point.
The couchbase version is 3.0.1 community edition.
Has someone run into this error before? If yes, how do I fix it?
Thanks in advance.

Related

MySQL performance difference between local and prod with large text column

I'm trying to figure out what could account for the very large performance difference between my dev environment (5 year old laptop) and our stage server (azure cloud). The table in question is a log table of web service requests for a service that processes XML. One of the columns in the table is the XML passed to the web service.
On my local computer it basically doesn't matter how many rows are in the table; performance is great. On the deployed server if there are more than a couple hundred rows then performance starts tanking quickly. A "select count(*)" on this table when it has 2000 rows in it will take 0.0017 seconds locally but close to 20 on the server. Even a simple insert of a new row takes a significant amount of time; on the order of whole seconds.
I found this article while researching the problem explanation of MySQL block performance. That makes sense to me and I'd be happy to implement the 1-to-1 solution but I don't want to do it until I understand why it's working fine locally and tanking on the server.
Are there some MySQL setting variables I can check to find the differences? I'd really like to get my local computer to have the same performance issue as the deployed so I can validate that the fix will work.
EDIT:
The create table queries are identical. MySQL versions are 5.7.23 and 5.7.22. I did notice that the buffer is 16x bigger on my local. Gonna try and get the server updated to the setting my local has and see if that resolves the issue.
The solution was updating the buffer pool size like Rick suggested.

Time to first byte 6secs Openshift

I am using this template (https://github.com/openshift-evangelists/php-quickstart) on a start node west 2 on Openshift. I assigned 256MB on the php container and 256MB on the MySQL container.
I have no data on MySQL and with really bare bone php scripts the time to first byte (TTFB) is 6 seconds. I don't get any delays to other websites like this and definitely not on my old Openshift 2 installation.
Is this normal? Is Openshift 3 slower like that for the free (starter) services? Or is there something I am doing wrong? Any way I can troubleshoot this further?
256MB is too little for MySQL, it usually wants to use more than that from what I have seen and why the default was set to 512MB. Unless that is, that it is dynamically working out how much memory it has available and tries to gobble as much as possible.
The behaviour with slow responses is a known issue which has been affecting a number of the Online Starter environments on and off. The issue is still being investigated and a solution implemented.
You can verify actual response times by getting inside the container using oc rsh or the web console and using curl against $HOSTNAME:8080.

Creating a large click-to-deploy cassandra cluster

I have worked through a number of quota issues in trying to stand up a 30 node click-to-deploy cassandra cluster. The next issue is that the data disks are not becoming available within the 300 seconds allotted in wait-for-disk.sh.
I've tried several times in us-central1-b, once in us-central1-a and the results range from half of the disks up to 24 of 30. The disks eventually all show up, so no quota issue here, just the timing as far as I can tell.
I've been able to ssh into one node and nearly figure out which steps to run, setting up required env vars and running the steps in /gagent/. I've gotten the disk mounted and configured and get cassandra started but the manually repaired node is still missing from the all-important CASSANDRA_NODE_VIEW_NAME and I must be missing some services because I still can't run cqlsh on the manually repaired node.
It's a bit tedious to set up this way but I could complete the cluster this way manually. Do I need to get it added to the view? How? Or is there a way to specify a longer timeout in wait-for-disk.sh? I'd be willing to wait a pretty long time over doing the remaining setup manually.
We'll look at updating the disk wait value for the next release. Thanks for the feedback! You should be able to join the Cassandra cluster manually after running the install scripts in /gagent. Let me know if you're still having trouble.

Couchbase 3.0 rebalance stuck

I had a 4 node couchbase cluster with 3 buckets each having 1 replica. However when one of my nodes when down a part of my dataset became inaccessible. I thought this might be because of the fact that I have an even number of nodes i.e 4 ( instead of say 3 or 5) and so I failed over 1 node. I then proceeded to rebalance the cluster at which point it got stuck. The only thing I can find in the logs is Bucket "staging" rebalance does not seem to be swap rebalance. Any idea how to recover from this ?
In my desperation I also tried changing the replicas of different buckets and then performing a rebalance. Nothing worked. This has happened once before as well, that time I had to dump my whole database out and load it into a brand new cluster because I couldn't even backup my database. This time that path is not an option since the data is critical and uptime is also important.
Couchbase support pointed to a bug where if there are empty vbuckets, rebalancing can hang. As per them, this is fixed in 2.0 but this is not !!!!.
The work around solution is to populate buckets with a minimum of 2048 short time to live (TTL >= (10 minutes per upgrade + (2 x rebalance_time)) x num_nodes) items so all vbuckets have something in them. We then populated all buckets successfully and were able to restart the rebalance process which completed fine.
This works for Couchbase 3.0.
Reference: http://www.pythian.com/blog/couchbase-rebalance-freeze-issue/

How to find out what is causing a slow down of the application?

This is not the typical question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)
The problem is almost certainly MySQL based. If you look at the final graph mysql/mysql_threads you can see the number of threads hits 200 (which I assume is your setting for max_connections) at 20:00. Once the max_connections has been hit things do tend to take a while to recover.
Using mtop to monitor MySQL just before the hour will really help you figure out what is going on but if you cannot install this you could just using SHOW PROCESSLIST;. You will need to establish your connection to mysql before the problem hits. You will probably see lots of processes queued with only 1 process currently executing. This will be the most likely culprit.
Having identified the query causing the problems you can attack your code. Without understanding how your application is actually working my best guess would be that using an explicit transaction around the problem query(ies) will probably solve the problem.
Good luck!