Why is idle cloud SQL instance showing 10 qps write requests? - mysql

I have connected an appmaker app to a 2nd generation MySQL instance on gcp.
ALl seems to work fine, but I noticed that cloud console believes this instance sees 10 write ops per second at all times, even when nothing should be running.
The SQL logs seems to say that there are no requests. Billing does not look off, so I'm just wondering if I see something like prober requests, although 10QPS is a bit high for that, and I would expect to see something in the logs.
Any insights would be very much appreciated.
Update: Looks like any gcp MySQL instance has a heartbeat every 2 seconds, or every second if automatic backups are enabled.
These heartbeats seem pretty cheap in terms of CPU utilization, but they seem to make storage grow slowly over time.
I'm still interested to know if the heartbeat frequency can be tuned lower (for non-replicated setups; replication heartbeat frequency can be tuned.)

Related

The Database Connection is going high and also the RDS CPU is hitting 100% during the load testing

When doing the load testing on my application the AWS RDS CPU is hitting 100% and corresponding requests are getting errored out. The RDS is m4.2x.large. With the same configuration the things were fine until 2 weeks back. There are no infra changes done on the environment neither the application level changes. The whole load test used to go smooth until complete 2hrs until 2 weeks back. There are no specific exception apart from GENERICJDBCEXCEPTION.
All other necessary services are up and running on respective instances.
We are using SQL as Database Management System.
Is there any chance that this happens suddenly? How to resolve this? Suggestions are much appreciated. This has created many problems.
Monitoring the slow logs and resolving them did not solve the problem.
Should we upgrade the RDS to next version?
Does more data on then DB slows the Database?
We have modified the connection pool parameters also and tried it.
With "load testing", are you able to finish one day's work in one hour? That sounds great! Or what do you mean by "load testing"?
Or are you trying to launch 200 threads in one second and they are stumbling over each other? That's to be expected. Do you really get 200 new connections in a single second? Or is it spread out?
1 million queries per day is no problem. A million queries all at once will fail.
Do not let your "load test" launch more threads than you can reasonably expect. They will all pile up, and latency will suffer while the server is giving each thread an equal chance.
Meanwhile, use the slowlog to find the "worst" queries in production. Then let's discuss the worst one or two -- Often an improved index makes that query work much faster, thereby no longer contributing to the train wreck.

MySQL not behaving Asynchronous with Amazon RDS Instance?

I'm encountering a very strange issue with our MySQL RDS deployment. When a complex Stored procedure that can take 10+ seconds to complete is called, all other calls to the database are bogged down and hung up. This includes any call to SHOW FULL PROCESSLIST. Note the calls are from external/other sessions. For example the Stored Procedures that are taking 10-20 seconds are called by our Web Service but my attempt at executing any queries or SHOW FULL PROCESSLIST are from the IDE on my system, so a completely different connection/session.
Yet my query hangs until the other process is complete, and Amazon RDS reports just 2.3% CPU usage for MySQL.
Heck, even opening the connection to RDS while these stored procedures are running takes forever, so something is very wrong - it's as if MySQL isn't operating in any asynchronous capacity.
Any ideas what's going on here? Am I missing a single simple default flag in RDS that's turned off asynchronous processing?
The issue was the class of instance we were using with AWS; it was just too small. Once we updated it to t2.medium, the problems disappeared. The unusual thing is what we were running really was nothing intensive with the database; however, it appears the t2.micro class is really designed to not be used in any real capacity. One of the issues is price starts compounding very quickly in AWS, even for a sandbox system. A small company can quickly find fees in excess of $1,000 just by running test environments. This is not reasonable given the service and performance level provided by AWS for the cost.

Google cloud SQL - CPU at 100%

Earlier we noticed that our Master DB CPU started spiking:
There wasn't any unusual traffic volume/load. Also, if you look at the earlier spikes they coincide with the Google backups, but it looks like there wasn't one on the 19th despite it saying that it was run in the operations logs. I'm guessing that the Google backup went wrong on the server and it went out of control the next morning when it eventually ran.
I've cloned that server and moved the traffic across to the new server and now the CPU has dropped to 10-20% but this is still a lot higher than normal (1-5%)
Things that I've checked:
- Process list
- Traffic volumes
- DB/Table sizes
Any ideas how to get to the bottom of what's causing the change? or how to fix?
High CPU usage in a database can be caused by a bunch of different things. It might have been a wide or inefficient query, a backup process gone wrong, or a few other likely suspects.
If your app can support downtime, you could try shutting it down and restarting to get a fresh state.
If you have the support package, you can also open a ticket and ask them to look into the spike farther. If you don't, you can still open an issue on the Cloud SQL issue tracker, but the response time might not be as fast.

AWS RDS MySQL performance drop after random timespan

QUESTION OUTLINE
Our AWS RDS instance starts slowing down after about 7-14 days, by a quite large factor (~400% load times for a specific set of queries). RDS monitoring shows no signs of resource shortage. (see below the question update for detailed problem description)
Question Update
So after more than one month of investigating and some developer support by AWS, I am not exactly closer to a solution.
Here are a couple of steps which I checked off the list, more or less without any further hint of the problem:
Index / Fragmentation (all tables have correct indexes/keys and have no fragmentation)
MySQL Stats Update (manually updating stats source)
Thread Concurrency (changing innodb_thread_concurrency to various different parameters)
Query Cache Hit Ratio doesn't show problems
EXPLAIN to see if any SELECTs are actually slow or not using indexes/keys
SLOW QUERY LOG (returns no results, because see paragraph below, it's a number of prepared SELECTs)
RDS and EC2 are within one VPC
For explanation, the used PlayFramework (2.3.8) has BoneCP and we are using eBeans to select our data. So basically I am running through a nested object and all those child objects, this produces a couple of hundred prepared SELECTs for the API call in question. This should basically also be fine for the used hardware, neither CPU nor RAM are extensively used by these operations.
I also included NewRelic for more insights on this issue and did some JVM profiling. Obviously, most of the time is consumed by NETTY/eBeans?
Is anyone able to make sense of this?
ORIGINAL QUESTION: Problem Outline
Our AWS RDS instance starts slowing down after about 7-14 days, by a quite large factor (~400% load times for a specific set of queries). RDS monitoring shows no signs of resource shortage.
Infrastructure
We run a PlayFramework backend for a mobile app on AWS EC2 instances, connected to AWS RDS MySQL instances, one PROD environment, one DEV environment. Usually the PROD EC2 instance is pointing to the PROD RDS instance, and the DEV EC2 points to the DEV RDS (hi from captain obvious!); however sometimes we also let the DEV EC2 point to the PROD DB for some testing purposes. The PlayFramework in use is working with BoneCP.
Detailed Problem Description
In a quite essential sync process, our app is making a certain API call many times a day per user. I discussed the backgrounds of the functionality in this SO question, where, thanks to comments, I could nail the problem down to be a MySQL issue of some kind.
In short, the API call is loading a set of data, the maximum is about 1MB of json data, which currently takes about 18s to load. When things are running perfectly fine, this takes about 4s to load.
Curious enough, what "solved" the problem last time was upgrading the RDS instance to another instance type (from db.m3.large to db.m4.large, which is just a very marginal step). Now, after about 2-3 weeks, the RDS instance is once again performing slow as before. Rebooting the RDS instance showed no effect. Also re-launching the EC2 instance shows no effect.
I also checked if the indices of the affected mySQL tables are set properly, which is the case. The API call itself is not eager-loading any BLOB fields or similar, I double-checked this. The CPU-usage of the RDS instances is below 1% most of the time, when I stress tested it with 100 simultaneous API calls, it went to ~5%, so this is not the bottleneck. Memory is fine too, so I guess the RDS instance doesn't start swapping which could slow down the whole process.
Giving hard evidence, a (smaller) public API call on the DEV environment currently takes 2.30s load, on the PROD environment it takes 4.86s. Which is interesting, because the DEV environment has both in EC2 and RDS a much smaller instance type. So basically the turtle wins the race here. (If you are interested in this API call I am happy to share it with you via PN, but I don't really want to post links to API calls, even if they are basically public.)
Conclusion
Concluding, it feels (I wittingly say 'feels') like the DB is clogged after x days of usage / after a certain amount of API calls. Not sure if this a RDS-specific issue, once I 'largely' reset the DB instance by changing the instance type, things run fast and smooth. But re-creating my DB instance from a snapshot every 2 weeks is not an option, especially if I don't understand why this is happening.
Do you have any ideas what further steps I could take to investigate this matter?
(Too long for just a comment) I know you have checked a lot of things, but I would like to look at them with a different set of eyes...
Please provide
SHOW VARIABLES; (probably need post.it or something, due to size)
SHOW GLOBAL STATUS;
how much RAM? Sounds like 7.5G
The query. -- Unclear what query/queries you are using
SHOW CREATE TABLE for the table(s) in the query -- indexes, datatypes, etc
(Some of the above may help with "clogging over time" question.)
Meanwhile, here are some guesses/questions/etc...
Some other customer sharing the hardware is busy.
It could be a network problem?
Shrink long_query_time to 1 so you can catch slow queries.
When are backups done on your instance?
4s-18s to load a megabyte -- what percentage of that is SQL statements?
Do you "batch" the inserts? Is it a single transaction? Are lengthy queries going on at the same time?
What, if any, MySQL tunables did you change from the AWS defaults?
6GB buffer_pool on a 7.5GB partition? That sounds dangerously tight. Can you see if there was any swapping?
Any PARTITIONing involved? (Of course the CREATE will answer that.)
There is one very important bit of information missing from your description: The total allocated space for the database. I/O for RDS is around 3x the allocated space, so for a 100GB allocation, you should get around 300 IOPS. That allocated space also includes logs.
Since you don't really know what's going on, the first step should be to turn on detailed monitoring, which will give you more idea of what is happening on the instance.
Until you have additional stats gathered during a slowdown, you can try increasing the allocated space, which will increase the IOPS available.
Also, check the events for the db - are logs getting purged on a regular basis? That might indicate that there's not enough space.
Finally, you can try going with PIOPS (provisioned IOPS) if you have an idea of what the application needs, though at this point it sounds like that would be a guess.
maybe your burst credit balance is (slowly) being depleted? finally, you end up with baseline performance, which may appear "too slow".
this would also explain why the upgrade to another instance type did help, as you then start with a full burst balance again.
i would suggest to increase the size of the volume, even if you don't need the extra space, as the baseline performance grows linearly with volume size.

Google VM Instance becomes unhealthy on its own

I have been using Google Cloud for quite some time and everything works fine. I was using single VM Instance to host both website and MySQL Database.
Recently, i decided to move the website to autoscale so that on days when the traffic increases, the website doesn't go down.
So, i moved the database to Cloud SQL and create a VM Group which will host the PHP, HTML, Image files. Then, i set up a load balancer to divert traffic to various VM Instances under VM Group.
The problem is that the Backend Service (VM Group inside load balancer) becomes unhealthy on its own after working fine for 5-6 hours and then again becomes healthy after 10-15 minutes. I have also seen that the problem can come when i run a file which is a bit lengthy with many MySQL Queries.
I checked the Health check and it was giving 200 response. During the down period of 10-15 minutes, the VM Instance is accessible from it own ip address.
Everything is same, i have just added a load balancer in front of the VM Instance and the problem has started.
Can anybody help me troubleshoot this problem?
It sounds like your server is timing out (blocking?) on the health check during the times the load balancer reports it as down. A few things you can check:
The logs (I'm presuming you're using Apache?) should include a duration along with the request status in the logs. The default health check timeout is 5s, so if your health check is returning a 200 in 6s, the health checker will time out after 5s and treat the host as down.
You mention that a heavy mysql load can cause the problem. Have you looked at disk I/O statistics and CPU to make sure that this isn't a load-related problem? If this is CPU or load related, you might look at increasing either CPU or disk size, or moving your disk from spindle-backed to SSD-backed storage.
Have you checked that you have sufficient threads available? Ideally, your health check would run fairly quickly, but it might be delayed (for example) if you have 3 threads and all three are busy running some other PHP script that's waiting on the database