We are running MySQL 5.6 on Windows Server 2008r2.
Every 30 minutes it runs very slowly for around 40 seconds and then goes back to normal for another 30 minutes. It is happening like clockwork with each ‘hang’ being 30 minutes after the last one finished.
Any ideas? We are stumped and don’t know where next to look.
Background / things we have ruled out below.
Thanks.
• Our initial thoughts were a locking query but we have eliminated this.
• The slow query log shows affected queries but with zero lock time.
• General logs show nothing (as an aside, is there a way to increase the logging level to get it to log when it is flushing the cache etc? What does MySQL run every 30 minutes?)
• When it is running slowly, it is still running but even simple queries like Select ‘Hello World’; take over a second to run.
• All MySQL operations run slowly at the time in question including monitoring tools and especially making new connections. InnoDB and MyISAM are equally affected.
• We have switched from using the SAN array to using local SSD and it has made no difference ruling out disk / spindles.
• The machine has Sophos Endpoint Protection but this is not scanning anything on the database drives.
• It is as if the machine is maxed out but local performance monitoring does show any unusual system metrics. CPU, disk queue, disk throughput, memory, network activity etc. are all flat.
• The machine is a VM running on VMware. Hypervisor monitoring is not showing any performance issues – but I am not convinced it is granular enough to pick up a 30 second spike.
• We have tried adjusting MySQL settings like the InnoDB cache size, log size etc and this has made no difference.
• The server runs nothing other than a couple of MySQL instances.
• The other instances are unaffected - as far as we can tell.
There's some decent advice here on Server Fault:
https://serverfault.com/questions/733590/mysql-stops-responding-periodically
Have you monitored Disk I/O? Is there an increase in I/O wait times or
queued transactions? It's possible that requests are queueing up at
the storage level due to an I/O limitation put on by your host. Also,
have you checked if you're hitting your max allowable mysql clients?
If these queries are suddenly taking a lot longer to complete, it's
also possible that it's not leaving enough available connections for
normal site traffic because the other connections aren't closing fast
enough.
I'd recommend using IOSTAT and seeing if you're saturating your disks. It should show if all your disks are at 100% usage, etc.
Related
I have a db.r4.4xlarge (should be able to handle 10000 connections per aws's metrics) mariadb rds instance on aws. Today we saw a big spike in database connections (around 1000 simultaneous and sustained connections). We have so many connections because our clients use has use has grown significantly after an acquisition they were a part of. This instance is supposed to support 10k connections. CPU and Memory were ok, ie., not pinned an no swapping to my knowledge. All queries, even simple ones started to take forever. Even stuff like
CREATE TEMPORARY TABLE if not exists TTT0dfa3c18b1036a73dd5d5581bddac484_2 LIKE TTT0dfa3c18b1036a73dd5d5581bddac484;
was taking > 5 seconds... and that's not even loading any table data, just the structure. I'm trying to get to the root of the issue and find a solution. Can anyone point me in a direction that may lead to a better understanding of what was going on. The crux of my confusion is that the cpu and memory looked ok while all this was happening, just the connections stood out as unusually high. Should I try using ec2 with ssd, etc, any ideas/thoughts would be helpful as I am not experienced with this type of issue.
below I've added a couple pics of the only metrics that stood out at all. connections and binlogusage and freeable memory. IO/cpu was good.
We are currently working with a 200 GB database and we are running out of space, so we would like to increment the allocated storage.
We are using General Purpose (SSD) and a MySQL 5.5.53 database (without Multi-AZ deployment).
If I go to the Amazon RDS menu and change the Allocated storage to a bit more (from 200 to 500) I get the following "warnings":
Deplete the initial General Purpose (SSD) I/O credits, leading to longer conversion times: What does this mean?
Impact instance performance until operation completes: And this is the most important question for me. Can I resize the instance with 0 downtime? I mean, I dont care if the queries are a bit slower if they work while it's resizing, but what I dont want to to is to stop all my production websites, resize the instance, and open them again (aka have downtime).
Thanks in advance.
You can expect degraded performance but you should really test the impact in a dev environment before running this on production so you're not caught off guard. If you perform this operation during off-peak hours you should be fine though.
To answer your questions:
RDS instances can burst with something called I/O credits. Burst means its performance can go above the baseline performance to meet spikes in demand. It shouldn't be a big deal if you burn through them unless your instance relies on them (you can determine this from the rds instance metrics). Have a read through I/O Credits and Burst Performance.
Changing the disk size will not result in a complete rds instance outage, just performance degradation so it's better to do it during off-peak hours to minimise the impact as much as possible.
First according to RDS FAQs, there should be no downtime at all as long as you are only increasing storage size but not upgrading instance tier.
Q: Will my DB instance remain available during scaling?
The storage capacity allocated to your DB Instance can be increased
while maintaining DB Instance availability.
Second, according to RDS documentation:
Baseline I/O performance for General Purpose SSD storage is 3 IOPS for
each GiB, which means that larger volumes have better performance....
Volumes below 1 TiB in size also have ability to burst to 3,000 IOPS
for extended periods of time (burst is not relevant for volumes above
1 TiB). Instance I/O credit balance determines burst performance.
I can not say for certain why but I guess when RDS increase the disk size, it may defragment the data or rearrange data blocks, which causes heavy I/O. If you server is under heavy usage during the resizing, it may fully consume the I/O credits and result in less I/O and longer conversion times. However given that you started with 200GB I suppose it should be fine.
Finally I would suggest you to use multi-az deployemnt if you are so worried about downtime or performance impact. During maintenance windows or snapshots, there will be a brief I/O suspension for a few seconds, which can be avoided with standby or read replicas.
The technical answer is that AWS supports no downtime when scaling storage.
However, in the real world you need to factor how busy your current database is and how the "slowdown" will affect users. Consider the possibility that connections might timeout or the site may appear slower than usual for the duration of the scaling event.
In my experience, RDS storage resizing has been smooth without problems. However, we pick the best time of day (least busy) to implement this. We also go thru a backup procedure. We snapshot and bring up a standby server to switch over to manually just in case.
I have the following problem.
Using REST, I am getting binary content (BLOBs) from a MySql database via a NodeJS Express app.
All works fine, but I am having issues scaling the solution.
I increased the number of NodeJS instances to 3 : they are running ports 4000,4001,4002.
On the same machine I have Nginx installed and configured to do a load balancing between my 3 instances.
I am using Apache Bench to do some perf testing.
Please see attached pic.
Assuming I have a dummy GET REST that goes to the db, reads the blob (roughly 600KB in size) and returns it back (all http), I am making 300 simultaneous calls. I would have thought that using nginx to distribute the requests would make it faster, but it does not.
Why is this happening?
I am assuming it has to do with MySql?
My NodeJs app is using a connection pool with a limit set to 100 connections. What should be the relation between this value and the max connection value in Mysql? If I increase the connection pool to a higher number of connections, I get worse results.
Any suggestion on how to scale?
Thanks!
"300 simultaneous" is folly. No one (today) has the resources to effectively do more than a few dozen of anything.
4 CPU cores -- If you go much beyond 4 threads, they will be stumbling over each over trying to get CPU time.
1 network -- Have you check to see whether your big blobs are using all the bandwidth, thereby being the bottleneck?
1 I/O channel -- Again, lots of data could be filling up the pathway to disk.
(This math is not quite right, but it makes a point...) You cannot effectively run any faster than what you can get from 4+1+1 "simultaneous" connections. (In reality, you may be able to, but not 300!)
The typical benchmarks try to find how many "connections" (or whatever) leads to the system keeling over. Those hard-to-read screenshots say about 7 per second is the limit.
I also quibble with the word "simultaneous". The only thing close to "simultaneous" (in your system) is the ability to use 4 cores "simultaneously". Every other metric involves sharing of resources. Based on what you say, ...
If you start about 7 each second, some resource will be topped out, but each request will be fast (perhaps less than a second)
If you start 300 all at once, they will stumble over each other, some of them taking perhaps minutes to finish.
There are two interesting metrics:
How many per second you can sustain. (Perhaps 7/sec)
How long the average (and, perhaps, the 95% percentile) takes.
Try 10 "simultaneous" connections and report back. Try 7. Try some other small numbers like those.
I've got a query that is running 5x slower on my staging server as opposed to my local dev machine.
Stackoverflow doesn't want to play nicely with the formatting; the query, describes, and explains are located here
Looking at the describe statements, I can't see any difference between the local and remote schemas.
The record counts for the 2 machines are in the same order of magnitude (500k vs 600k)
Edit In Response to Comments
It was my highly unscientific approach of throwing the queries into MySQL Workbench and looking at the query time. The local query time was on the order of 1.3 seconds and the remote query time was on the order of 5.2 seconds (so its 4x as slow). I'm sure there's a better way to test this query time.
The machines are different. My dev machine is a Mac Book Pro with 8 gigs of RAM. The staging server is a linode VPS with 512 megabytes of RAM. There shouldn't be much load on the staging server (I'm the only one that uses it). I've noticed most queries run in approximately the same time frame on the local machine and staging server, so I was confused as to why this one had such a drastically different time frame.
RAM Issue
Since a temporary table isn't being used (no mention in the EXPLAINS), is the amount of RAM still an issue?
Output from free
total used free shared buffers cached
Mem: 508576 453880 54696 0 4428 254200
-/+ buffers/cache: 195252 313324
Swap: 262140 19500 242640
Profiling Added to Gist
It looks like the remote is taking 2.5 seconds 'sending data' whereas the local is only taking 0.5 seconds. Is this an I/O issue? (Complete profiling info in gist)
Your staging server has one sixteenth of the RAM that you Mac Book Pro has.
Without knowing how much RAM is available to your two instances of MySQL, it's hard to be definitive, but that's the first place I'd look.
Also, if you run these queries from the MySQL command line, locally, how do the times compare?
It could be that the increase in time is in network transfer and not query processing.
Actually... network transfer time is the first place I'd look... then MySQL memory usage.
EDIT following question updates
The 'sending data' phase is the phase where the server is sending data to the client ref. I don't know exactly how large your dataset is, but 2.5s seems pretty high for what's probably 50kB of data or so.
Having looked at the profiling data, nearly all the time is spent sending data, so I'd strongly suspect the network here.
EDIT 2
Some research lead me to this page which indicates that the 'Sending data' is misleading and that this is actually the time spend executing your query.
Thus, I really think you need to be looking at CPU and memory usage on your server since it's specced at a level so much lower than your MacBook.
my site started dragging lately, the queries taking exceptionally longer than I would expect with properly tuned indexes. I just restarted the mysql server after 31 days uptime and every query is now substantially faster and the whole site renders 3-4 times faster.
Would there be anything that jumps out at you as to why this may have been? Improper settings on my.cnf perhaps? Any ideas as to what I can start looking at to try and pinpoint why?
thanks
updated note: I have a 16GB dedicated db box and mysql runs at about 71% of memory after a week or so.
Try to execute show processlist;, maybe there are some long lasting threads that were not killed for some reason.
Similarly execute SHOW SESSION STATUS LIKE 'Created%'; to check if mysql hasn't created to many temporary tables.
Server restart automatically closes all open temp tables and kills threads, so the application might run quicker.
Do you have temporary table(s) that might not be getting cleared/collected?
I would suggest using MySQL Enterprise for analysis purposes. It comes with a 30 day trial. We just used it. We got alerts such as :
CRITICAL Alert - Table Scans Excessive
The target server does not appear to be using indexes efficiently.
CRITICAL Alert - Connection Usage Excessive
CRITICAL Alert - CPU Usage Excessive
WARNING Alert - MyISAM Key Cache Has Sub-Optimal Hit Rate
Just something to explore!