.NET Core API stalls on high load - mysql

I'm currently in the first phase of optimizing a gaming back-end. I'm using .NET Core 2.0, EF and MySQL. Currently all of this is running on a local machine, used for dev. To do initial load testing, I've written a small console app that simulates the way the final client will use the API. The API is hosted under IIS 8.5 on a Windows Server 2012R2 machine. The simulating app is run on 1-2 separate machines.
So, this all works very well for around 100-120 requests/s. The CPU load is around 15-30 on the server, the number of connections on the MySQL server averaging around 100 (I've set the max_connections to 400, and it's never near that value). Response times are averaging way below 100ms. However, as soon as we enter request figures a bit higher than that, it seems the system completely stalls on intervals. The CPU load drops to < 5, and the response times in the same time skyrockets. So, it kind of acts like a traffic jam situation. During the stall, both the MySQL and the dotnet exe's seem to "rest".
I do realize I'm no where near a production setup on anything, the MySQL instance is in dev etc. However, I'm still curious what would be the cause of this. Any ideas?

Related

Identifying the reason the same ASP VBScript code runs at two different speed on two seperate machines

I was hoping for some guidance in this respect. I have the same set of ASP VBScript code running on two separate machines. The new machine has a better CPU (and more cores), 10 times the amount of RAM and an SSD hard drive (whereas the original was a standard Western Digital drive). Both machines have the same OS running IIS and MySQL.
However where the initial machine (which is 5 years old) will complete the processing of files (200 files) with multiple file reads, multiple database deletes, inserts and selects in under 2 hours, the seconds (far gutsier machine) takes 5 hours.Both machines are running the identical MySQL, IIS, Python and ASP code.
The CPU on the new machine is not burdened (sits idle at 2%), the RAM is not over utilized in any way (under 10% utilization). The code runs in series and doesn't run in parallel threads.
I was hoping for some guidance as to where to investigate the cause without having to rewrite multiple lines of code for better efficiency to try squeeze a second here and a second there. The resources on the new machine are not being utilized at all but I am feeling rather lost in the dark as to how to navigate this and hours of Google searches haven't yielded many positive solutions.

ServiceStack Register web service slow performance

We noticed some performance bottlenecks in Service Stack web services especially the ones that comes out of the box like (Register) Web Service.
We ran a load-test using Visual Studio Load Test with the following parameters :
1K concurrent user.
1-minute duration.
5-seconds think time between test iterations.
5-seconds sample rate.
The results are so bad that they are actually preventing us from going live with a customer :
19 Seconds Average Response Time
Environment Specs are :
2 Web Front Ends (IIS) hosted in AWS Europe Region using c4.8xlarge EC2
Instances(16GB-Ram & 8vCPU) behind a publicly exposed load balancer.
MySql Database hosted in AWS RDS running inside (db.m4.4xlarge) EC2
instance (64GB Ram, High Network Traffic, 16 vCPU)
We don't have special code or special global request filters .. only the default configuration .. we even tried connection pooling but that didn't help that much ..
What would be the reason behind such slow performance ? Appreciate your support as we are in a point where the customer is questioning the ServiceStack framework itself and we are starting to have doubts about that as well even though we loved every aspect of it.
We came down to the bottom of this .. After hours of debugging and profiling register/login webs services, we found out that register code executes duplicated queries to the db (like check existing user validation logic etc..) that was even highlighted by Mini Profiler but that still wasn't the reason behind failing on only 1K concurrent users hitting the services which is very low number compared to the environment specs we ran on.
The reason was due to the following code getting called in both register/login :
private static TUserAuth GetUserAuthByUserName(IDbConnection db, string userNameOrEmail)
{
var isEmail = userNameOrEmail.Contains("#");
var userAuth = isEmail
? db.Select<TUserAuth>(q => q.Email.ToLower() == userNameOrEmail.ToLower()).FirstOrDefault()
: db.Select<TUserAuth>(q => q.UserName.ToLower() == userNameOrEmail.ToLower()).FirstOrDefault();
return userAuth;
}
The calls to .ToLower() got translated to SQL lower function getting called and when this is run concurrently and in a table where you have hundreds of thousands of rows, it would cause huge CPU spikes in the db server causing all the bottlenecks.
The fix was as simple as adding a dedicated lowered username and email fields in the database, updating UserAuth POCO to reflect those and finally adding an index to the new db columns and adjusting ormlite where condition to use the new columns.

Google VM Instance becomes unhealthy on its own

I have been using Google Cloud for quite some time and everything works fine. I was using single VM Instance to host both website and MySQL Database.
Recently, i decided to move the website to autoscale so that on days when the traffic increases, the website doesn't go down.
So, i moved the database to Cloud SQL and create a VM Group which will host the PHP, HTML, Image files. Then, i set up a load balancer to divert traffic to various VM Instances under VM Group.
The problem is that the Backend Service (VM Group inside load balancer) becomes unhealthy on its own after working fine for 5-6 hours and then again becomes healthy after 10-15 minutes. I have also seen that the problem can come when i run a file which is a bit lengthy with many MySQL Queries.
I checked the Health check and it was giving 200 response. During the down period of 10-15 minutes, the VM Instance is accessible from it own ip address.
Everything is same, i have just added a load balancer in front of the VM Instance and the problem has started.
Can anybody help me troubleshoot this problem?
It sounds like your server is timing out (blocking?) on the health check during the times the load balancer reports it as down. A few things you can check:
The logs (I'm presuming you're using Apache?) should include a duration along with the request status in the logs. The default health check timeout is 5s, so if your health check is returning a 200 in 6s, the health checker will time out after 5s and treat the host as down.
You mention that a heavy mysql load can cause the problem. Have you looked at disk I/O statistics and CPU to make sure that this isn't a load-related problem? If this is CPU or load related, you might look at increasing either CPU or disk size, or moving your disk from spindle-backed to SSD-backed storage.
Have you checked that you have sufficient threads available? Ideally, your health check would run fairly quickly, but it might be delayed (for example) if you have 3 threads and all three are busy running some other PHP script that's waiting on the database

How to find out what is causing a slow down of the application?

This is not the typical question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)
The problem is almost certainly MySQL based. If you look at the final graph mysql/mysql_threads you can see the number of threads hits 200 (which I assume is your setting for max_connections) at 20:00. Once the max_connections has been hit things do tend to take a while to recover.
Using mtop to monitor MySQL just before the hour will really help you figure out what is going on but if you cannot install this you could just using SHOW PROCESSLIST;. You will need to establish your connection to mysql before the problem hits. You will probably see lots of processes queued with only 1 process currently executing. This will be the most likely culprit.
Having identified the query causing the problems you can attack your code. Without understanding how your application is actually working my best guess would be that using an explicit transaction around the problem query(ies) will probably solve the problem.
Good luck!

Service deployed on Tomcat crashing under heavy load

I'm having trouble with a web service deployed on Tomcat. During peak traffic times the server is becoming non response and forces me to restart the entire server in order to get it working again.
First of all, I'm pretty new to all this. I built the server myself using various guides and blogs. Everything has been working great, but due to the larger load of traffic, I'm now getting out of my league a little. So, I need clear instructions on what to do or to be pointed towards exactly what I need to read up on.
I'm currently monitoring the service using JavaMelody, so I can see the spikes occurring, but I am unaware how to get more detailed information than this as to possible causes/solutions.
The server itself is quad core with 16gb ram, so the issue doesn't lie there, more likely in the fact I need to properly configure Tomcat to be able to use this (or setup a cluster...?)
JavaMelody shows the service crashing when the cpu usage only gets to about 20%, and about 300 hits a minute. Is there any max connection limits of memory settings that I should be configuring?
I also only have a single instance of the service deployed. I understand I can simply rename the war file and Tomcat deploys a second instance. Will doing this help?
Each request also opens (and immediately closes) a connection to mySQL to retrieve data, I probably need to be sure it's not getting throttled there too.
Sorry this is so long winded and has multiple questions. I can give more information as needed, I am just not certain what needs to be given at this time!
The server has 16Gs of ram but how much memory do you have dedicated to tomcat, -Xms and -Xmx?