Creating a large click-to-deploy cassandra cluster - google-compute-engine

I have worked through a number of quota issues in trying to stand up a 30 node click-to-deploy cassandra cluster. The next issue is that the data disks are not becoming available within the 300 seconds allotted in wait-for-disk.sh.
I've tried several times in us-central1-b, once in us-central1-a and the results range from half of the disks up to 24 of 30. The disks eventually all show up, so no quota issue here, just the timing as far as I can tell.
I've been able to ssh into one node and nearly figure out which steps to run, setting up required env vars and running the steps in /gagent/. I've gotten the disk mounted and configured and get cassandra started but the manually repaired node is still missing from the all-important CASSANDRA_NODE_VIEW_NAME and I must be missing some services because I still can't run cqlsh on the manually repaired node.
It's a bit tedious to set up this way but I could complete the cluster this way manually. Do I need to get it added to the view? How? Or is there a way to specify a longer timeout in wait-for-disk.sh? I'd be willing to wait a pretty long time over doing the remaining setup manually.

We'll look at updating the disk wait value for the next release. Thanks for the feedback! You should be able to join the Cassandra cluster manually after running the install scripts in /gagent. Let me know if you're still having trouble.

Related

The Database Connection is going high and also the RDS CPU is hitting 100% during the load testing

When doing the load testing on my application the AWS RDS CPU is hitting 100% and corresponding requests are getting errored out. The RDS is m4.2x.large. With the same configuration the things were fine until 2 weeks back. There are no infra changes done on the environment neither the application level changes. The whole load test used to go smooth until complete 2hrs until 2 weeks back. There are no specific exception apart from GENERICJDBCEXCEPTION.
All other necessary services are up and running on respective instances.
We are using SQL as Database Management System.
Is there any chance that this happens suddenly? How to resolve this? Suggestions are much appreciated. This has created many problems.
Monitoring the slow logs and resolving them did not solve the problem.
Should we upgrade the RDS to next version?
Does more data on then DB slows the Database?
We have modified the connection pool parameters also and tried it.
With "load testing", are you able to finish one day's work in one hour? That sounds great! Or what do you mean by "load testing"?
Or are you trying to launch 200 threads in one second and they are stumbling over each other? That's to be expected. Do you really get 200 new connections in a single second? Or is it spread out?
1 million queries per day is no problem. A million queries all at once will fail.
Do not let your "load test" launch more threads than you can reasonably expect. They will all pile up, and latency will suffer while the server is giving each thread an equal chance.
Meanwhile, use the slowlog to find the "worst" queries in production. Then let's discuss the worst one or two -- Often an improved index makes that query work much faster, thereby no longer contributing to the train wreck.

Time to first byte 6secs Openshift

I am using this template (https://github.com/openshift-evangelists/php-quickstart) on a start node west 2 on Openshift. I assigned 256MB on the php container and 256MB on the MySQL container.
I have no data on MySQL and with really bare bone php scripts the time to first byte (TTFB) is 6 seconds. I don't get any delays to other websites like this and definitely not on my old Openshift 2 installation.
Is this normal? Is Openshift 3 slower like that for the free (starter) services? Or is there something I am doing wrong? Any way I can troubleshoot this further?
256MB is too little for MySQL, it usually wants to use more than that from what I have seen and why the default was set to 512MB. Unless that is, that it is dynamically working out how much memory it has available and tries to gobble as much as possible.
The behaviour with slow responses is a known issue which has been affecting a number of the Online Starter environments on and off. The issue is still being investigated and a solution implemented.
You can verify actual response times by getting inside the container using oc rsh or the web console and using curl against $HOSTNAME:8080.

Couchbase 3.0 rebalance stuck

I had a 4 node couchbase cluster with 3 buckets each having 1 replica. However when one of my nodes when down a part of my dataset became inaccessible. I thought this might be because of the fact that I have an even number of nodes i.e 4 ( instead of say 3 or 5) and so I failed over 1 node. I then proceeded to rebalance the cluster at which point it got stuck. The only thing I can find in the logs is Bucket "staging" rebalance does not seem to be swap rebalance. Any idea how to recover from this ?
In my desperation I also tried changing the replicas of different buckets and then performing a rebalance. Nothing worked. This has happened once before as well, that time I had to dump my whole database out and load it into a brand new cluster because I couldn't even backup my database. This time that path is not an option since the data is critical and uptime is also important.
Couchbase support pointed to a bug where if there are empty vbuckets, rebalancing can hang. As per them, this is fixed in 2.0 but this is not !!!!.
The work around solution is to populate buckets with a minimum of 2048 short time to live (TTL >= (10 minutes per upgrade + (2 x rebalance_time)) x num_nodes) items so all vbuckets have something in them. We then populated all buckets successfully and were able to restart the rebalance process which completed fine.
This works for Couchbase 3.0.
Reference: http://www.pythian.com/blog/couchbase-rebalance-freeze-issue/

Lots of "Query End" states in MySQL, all connections used in a matter of minutes

This morning I noticed that our MySQL server load was going sky high. Max should be 8 but it hit over 100 at one point. When I checked the process list I found loads of update queries (simple ones, incrementing a "hitcounter") that were in query end state. We couldn't kill them (well, we could, but they remained in the killed state indefinitely) and our site ground to a halt.
We had loads of problems restarting the service and had to forcibly kill some processes. When we did we were able to get MySQLd to come back up but the processes started to build up again immediately. As far as we're aware, no configuration had been changed at this point.
So, we changed innodb_flush_log_at_trx_commit from 2 to 1 (note that we need ACID compliance) in the hope that this would resolve the problem, and set the connections in PHP/PDO to be persistent. This seemed to work for an hour or so, and then the connections started to run out again.
Fortunately, I set a slave server up a couple of months ago and was able to promote it and it's taking up the slack for now, but I need to understand why this has happened and how to stop it, since the slave server is significantly underpowered compared to the master, so I need to switch back soon.
Has anyone any ideas? Could it be that something needs clearing out? I don't know what, maybe the binary logs or something? Any ideas at all? It's extremely important that we can get this server back as the master ASAP but frankly I have no idea where to look and everything I have tried so far has only resulted in a temporary fix.
Help! :)
I'll answer my own question here. I checked the partition sizes with a simple df command and there I could see that /var was 100% full. I found an archive that someone had left that was 10GB in size. Deleted that, started MySQL, ran a PURGE LOGS BEFORE '2012-10-01 00:00:00' query to clear out a load of space and reduced the /var/lib/mysql directory size from 346GB to 169GB. Changed back to master and everything is running great again.
From this I've learnt that our log files get VERY large, VERY quickly. So I'll be establishing a maintenance routine to not only keep the log files down, but also to alert me when we're nearing a full partition.
I hope that's some use to someone in the future who stumbles across this with the same problem. Check your drive space! :)
We've been having a very similar problem, where the mysql processlist showed that almost all of our connections were stuck in the "query end" state. Our problem was also related to replication and writing the binlog.
We changed the sync_binlog variable from 1 to 0, which means that instead of flushing binlog changes to disk on each commit, it allows the operating system to decide when to fsync() to the binlog. That entirely resolved the "query end" problem for us.
According to this post from Mats Kindahl, writing to the binlog won't be as much of a problem in the 5.6 release of MySQL.
In my case, it was indicative of maxing out the I/O on disk. I had already reduced fsyncs to a minimum, so it wasn't that. Another symptoms is "log*.tokulog*" files start accumulating because the system can't catch up all the writes.
I had met this problem in my production use case . I was using replace into select in the table for archival and it has 10 lakhs records init . I interrupted the query in the middle and it went to killed state .
i had set the innodb_flush_log_at_trx_commit to 1 as well as sync_binlog=1 and it completely recovered in my case . And i have triggered my archival script again without any issues .

How to find out what is causing a slow down of the application?

This is not the typical question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)
The problem is almost certainly MySQL based. If you look at the final graph mysql/mysql_threads you can see the number of threads hits 200 (which I assume is your setting for max_connections) at 20:00. Once the max_connections has been hit things do tend to take a while to recover.
Using mtop to monitor MySQL just before the hour will really help you figure out what is going on but if you cannot install this you could just using SHOW PROCESSLIST;. You will need to establish your connection to mysql before the problem hits. You will probably see lots of processes queued with only 1 process currently executing. This will be the most likely culprit.
Having identified the query causing the problems you can attack your code. Without understanding how your application is actually working my best guess would be that using an explicit transaction around the problem query(ies) will probably solve the problem.
Good luck!