Google VM Instance becomes unhealthy on its own

Google VM Instance becomes unhealthy on its own - google-compute-engine

I have been using Google Cloud for quite some time and everything works fine. I was using single VM Instance to host both website and MySQL Database.
Recently, i decided to move the website to autoscale so that on days when the traffic increases, the website doesn't go down.
So, i moved the database to Cloud SQL and create a VM Group which will host the PHP, HTML, Image files. Then, i set up a load balancer to divert traffic to various VM Instances under VM Group.
The problem is that the Backend Service (VM Group inside load balancer) becomes unhealthy on its own after working fine for 5-6 hours and then again becomes healthy after 10-15 minutes. I have also seen that the problem can come when i run a file which is a bit lengthy with many MySQL Queries.
I checked the Health check and it was giving 200 response. During the down period of 10-15 minutes, the VM Instance is accessible from it own ip address.
Everything is same, i have just added a load balancer in front of the VM Instance and the problem has started.
Can anybody help me troubleshoot this problem?

It sounds like your server is timing out (blocking?) on the health check during the times the load balancer reports it as down. A few things you can check:
The logs (I'm presuming you're using Apache?) should include a duration along with the request status in the logs. The default health check timeout is 5s, so if your health check is returning a 200 in 6s, the health checker will time out after 5s and treat the host as down.
You mention that a heavy mysql load can cause the problem. Have you looked at disk I/O statistics and CPU to make sure that this isn't a load-related problem? If this is CPU or load related, you might look at increasing either CPU or disk size, or moving your disk from spindle-backed to SSD-backed storage.
Have you checked that you have sufficient threads available? Ideally, your health check would run fairly quickly, but it might be delayed (for example) if you have 3 threads and all three are busy running some other PHP script that's waiting on the database

Related

MySQL not behaving Asynchronous with Amazon RDS Instance?

I'm encountering a very strange issue with our MySQL RDS deployment. When a complex Stored procedure that can take 10+ seconds to complete is called, all other calls to the database are bogged down and hung up. This includes any call to SHOW FULL PROCESSLIST. Note the calls are from external/other sessions. For example the Stored Procedures that are taking 10-20 seconds are called by our Web Service but my attempt at executing any queries or SHOW FULL PROCESSLIST are from the IDE on my system, so a completely different connection/session.
Yet my query hangs until the other process is complete, and Amazon RDS reports just 2.3% CPU usage for MySQL.
Heck, even opening the connection to RDS while these stored procedures are running takes forever, so something is very wrong - it's as if MySQL isn't operating in any asynchronous capacity.
Any ideas what's going on here? Am I missing a single simple default flag in RDS that's turned off asynchronous processing?

The issue was the class of instance we were using with AWS; it was just too small. Once we updated it to t2.medium, the problems disappeared. The unusual thing is what we were running really was nothing intensive with the database; however, it appears the t2.micro class is really designed to not be used in any real capacity. One of the issues is price starts compounding very quickly in AWS, even for a sandbox system. A small company can quickly find fees in excess of $1,000 just by running test environments. This is not reasonable given the service and performance level provided by AWS for the cost.

Why is idle cloud SQL instance showing 10 qps write requests?

I have connected an appmaker app to a 2nd generation MySQL instance on gcp.
ALl seems to work fine, but I noticed that cloud console believes this instance sees 10 write ops per second at all times, even when nothing should be running.
The SQL logs seems to say that there are no requests. Billing does not look off, so I'm just wondering if I see something like prober requests, although 10QPS is a bit high for that, and I would expect to see something in the logs.
Any insights would be very much appreciated.
Update: Looks like any gcp MySQL instance has a heartbeat every 2 seconds, or every second if automatic backups are enabled.
These heartbeats seem pretty cheap in terms of CPU utilization, but they seem to make storage grow slowly over time.
I'm still interested to know if the heartbeat frequency can be tuned lower (for non-replicated setups; replication heartbeat frequency can be tuned.)

Railo websites can't connect during MySQL backup

I have a weekly backup that runs for one of my mysql databases for one of my websites (ccms). This backup is about 1.2GB and takes about 30 min to run.
When this database backup runs, all my other railo websites can not connect and go "down" for the duration of the backup.
One of the errors I have managed to catch was:
"[show] railo.runtime.exp.RequestTimeoutException: request (:119) is run into a
timeout (1200 seconds) and has been stopped. open locks at this time (c:/railo/webapps/root/ccms/parsed/photo.view.cfm,
c:/railo/webapps/root/ccms/parsed/profile.view.cfm, c:/railo/webapps/root/ccms/parsed/album.view.cfm,
c:/railo/webapps/root/ccms/parsed/public.dologin.cfm)."
What I believe is happening is that the tables required for those pages (the "ccms" website) are being locked due to the backup, which is fair enough.
But, why is that causing the other railo websites to time out? For example, the error I pasted above was actually taken from a different website, not the "ccms" website that it references in the error. Any website I try and run fails and throws an error that references the "ccms" website, which is the one being backed up. How do I avoid this?
Any insight would be greatly appreciated.
Thanks

One possibility is that because your timeout appears to be 20 minutes is that each time a request comes in to the site which IS being backed up, that thread blocks on waiting for the DB.
Railo has a pool of worker threads to handle requests and now one of them is tied up. As requests continue to come in, any requests to the affected site tie up another thread. Eventually there are no more workers in the pool and all subsequent requests are queued up to be processed once workers become available.
I'm not an expert on debugging Railo, but the above seems plausible to me. You could consider running different Railo processes for different sites, which would isolate them or drastically lowering your DB timeout (if acceptable).

How to find out what is causing a slow down of the application?

This is not the typical question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)

The problem is almost certainly MySQL based. If you look at the final graph mysql/mysql_threads you can see the number of threads hits 200 (which I assume is your setting for max_connections) at 20:00. Once the max_connections has been hit things do tend to take a while to recover.
Using mtop to monitor MySQL just before the hour will really help you figure out what is going on but if you cannot install this you could just using SHOW PROCESSLIST;. You will need to establish your connection to mysql before the problem hits. You will probably see lots of processes queued with only 1 process currently executing. This will be the most likely culprit.
Having identified the query causing the problems you can attack your code. Without understanding how your application is actually working my best guess would be that using an explicit transaction around the problem query(ies) will probably solve the problem.
Good luck!

Service deployed on Tomcat crashing under heavy load

I'm having trouble with a web service deployed on Tomcat. During peak traffic times the server is becoming non response and forces me to restart the entire server in order to get it working again.
First of all, I'm pretty new to all this. I built the server myself using various guides and blogs. Everything has been working great, but due to the larger load of traffic, I'm now getting out of my league a little. So, I need clear instructions on what to do or to be pointed towards exactly what I need to read up on.
I'm currently monitoring the service using JavaMelody, so I can see the spikes occurring, but I am unaware how to get more detailed information than this as to possible causes/solutions.
The server itself is quad core with 16gb ram, so the issue doesn't lie there, more likely in the fact I need to properly configure Tomcat to be able to use this (or setup a cluster...?)
JavaMelody shows the service crashing when the cpu usage only gets to about 20%, and about 300 hits a minute. Is there any max connection limits of memory settings that I should be configuring?
I also only have a single instance of the service deployed. I understand I can simply rename the war file and Tomcat deploys a second instance. Will doing this help?
Each request also opens (and immediately closes) a connection to mySQL to retrieve data, I probably need to be sure it's not getting throttled there too.
Sorry this is so long winded and has multiple questions. I can give more information as needed, I am just not certain what needs to be given at this time!

The server has 16Gs of ram but how much memory do you have dedicated to tomcat, -Xms and -Xmx?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008