Azure "MySQL server has gone away" for one minute only - mysql

I'm using Azure App Services to run about 15 PHP web apps. Most of these apps connect to my 'Azure Database for MySQL server' instance. This is a Basic-tier instance (1 vCore & 2GB memory).
The MySQL instance hosts about 30 small databases (ranging between 1 to 100MB in size).
The load on the MySQL instance is stable and low. CPU is constantly under 20%, memory is constantly under 50% and IO does not even show up in the metrics in the Azure Portal.
My problem is this:
Every once in a while the server goes offline for about 1 or 2 minutes (max 5 min). I see that client applications try to connect, they hang for a while to finally get the error:
SQLSTATE[HY000] [2006] MySQL server has gone away
It seems to happen randomly. Sometimes a few times a week or even a day. But sometimes it doesn't happen for weeks.
What's noticeable though, when it happens I see a downward spike in memory and an upward spike in CPU in the metrics graph on the portal like this:
Does anyone experience the same issue on Azure Database for MySQL? And did anyone find a solution?
I'm starting to think that it's caused by a resources movement on the Azure side but I don't have any evidence to back that up. If so, shouldn't that happen without any downtime?

Scaling up from the Basic 1 core tier with Compute Gen 4 to Basic 2 core tier with Compute Gen 5 seemed to resolve the problem.
Not sure though what was causing the issue though.

I started experiencing this error in May 2019.
If I happen to be connected on the mariadb server with ssh at the time it occurs and htop is running, I can see rsyslog suddenly going crazy. It bogs down the CPU and the network connection becomes unresponsive. The CPU and network activity doesn't show up in Azure but running w in the ssh session after the network recovers shows the CPU load was definitely very high during the last 15 minutes.
I traced it back to OMS agent. When that service is killed on the mariadb server, the server runs without any problem. As soon as OMS agent is started, "Mysql has gone away" pops up on the clients within 24 hours due to unresponsive network connection with the server machine.
It is possible to uninstall OMS agent from the Azure portal but it comes back within 48H.
The only way I found of getting rid of OMS agent is to stop walinuxagent too on the linux server.
Scaling the server up may solve the problem as you have more CPU power to process the extra CPU load induced by OMS agent. I prefer to kill OMS agent and walinuxagent instead of spending more money on an expansive server.
Edit:
It turns out OMS is installed because the VM is part of a Log Analytics workspace (search for Log Analytics workspaces in the search bar). Removing the VM from the workspace immediately uninstall OMS. There is no need to stop walinuxagent.

Related

Cannot connect to my Compute Engine instance - IP is being blocked?

It appears my GCP Compute Engine service/instance/whatever-you-call-it is refusing connections from my machine at times. I was just trying to set up an SFTP connection through a desktop app and probably failed a password too many times.
But I don't have Fail2Ban installed, and I don't see any Firewall Rules in the GCP interface blocking my IP. During what I perceive as the block, I can't even ping the machine. As soon as I switch to my cellphone's hotspot - I can ping it again. See screenshot below - I switched to the hotspot mid-way in that ping.
Does anyone know where I can look to control this setting and/or see what's being done here?
lastb output reflects regular attempts to get into my machine so I don't understand why something is being so harsh on me while this level of spam is flowing to the Linux level at least.
Found the answer - it's sshguard running on linux.
in /var/log/auth.log
Apr 19 01:43:05 x-x sshguard[696]: Blocking "-.-.-.-/32" for 122880 secs (3 attacks in 1 secs, after 11 abuses over 3268716 secs.)

Instance is overutilized. Consider switching to the machine type: g1-small

I created a new f1 micro instance with Ubuntu 16.04. I haven't logged in yet as I have not figured out how to create the SSH key-pair yet. But after two days, the Dashboard now shows:
Instance "xxx" is overutilized. Consider switching to the machine type: g1-small
Why is this happening? Isn't a f1 micro similar to an ec2 t1.nano? I have a t1.nano running a Node.js web site (with nginx, pm2, etc) and my CPU credit has been consistently at the maximum of 150 during this period with only me as a test user.
I started the f1 micro to run the same Node application to see which is more cost-effective. The parameter that was cloudy to me was that unexplained "0.2 virtual CPU". Is 0.2 CPU virtually unuseable? Would 0.5 (g1 small) be significantly better?
To address your connection problems, perhaps temporarily until you figure out the manual key management, you might want to try SSH from the browser which is possible from the Cloud Platform console or use gcloud CLI to assist you.
https://cloud.google.com/compute/docs/instances/connecting-to-instance
Once you get access via the terminal I would run 'top' or 'ps'.
Example of using ps to find the top CPU users:
ps wwaxr -o pid,stat,%cpu,time,command | head -10
Example of running top to find the top memory users:
top -l 1 -o rsize | head -20
Google Cloud also offers a monitoring product called Stackdriver which would give you this information in the Cloud console but it requires an agent to be running on your VM. See the getting started guide if this sounds like a good option for you.
https://cloud.google.com/monitoring/quickstart-lamp
Once you get access to the resource usage data you should be able to determine if 1) the VM isn't powerful enough to run your node.js server or 2) perhaps something else got started on the host unexpectedly and that's the source of your usage.

EC2 Instance is running very slow

I am running an EC2 Instance on Ubuntu Server machine. Tomcat and MySQL are installed and deployed java web-application on it since 1 month. It was running good with great performance for almost 1 month but now my application is responding very slow.
Also, point to note is: Earlier when I used to log into my Ubuntu Server through PuTTY, it was quick but now its taking time even when I enter Ubuntu password.
Is there any solution?
I would start with checking with memory/CPU/network availability to check if it is not bottleneck.
Try following commands:
To check memory availability:
free -m
To check CPU usage:
top
To check network usage:
ntop
To check disk usage:
df -h
To check disk io operations:
iotop
Please also check if when you disable your application you are able to quickly log in to that machine. If login is still slow, then you should contact your EC2 support complaining about poor performance and asking for assigning more resources for that machine.
You can use WAIT Tool to diagnose what is wrong with your server or your application. The tool will gather all information about CPU and memoru utilization, running threads etc.
In addition, I would definitely check Tomcat application server with VisualVM or some other profiler. For configuring JMX for Tomcat you can check article here.
For network monitoring - nload tool is worth your attention. You can launch it in screen so you always check network utilization stats when server is slown.
First check is there any application using too much cpu or memory. This can be checked by using top command. I'll tell you two simple shortcut keys that may be helpful while using top command. In top command result page, if you enter M it will sort application based on memory usage, from highest to lowest. If you enter P it will sort application based on cpu usage, from highest to lowest.
If you are unable to find any suspicious application using top you can use iotop it will show disk I/O usage details.
I was facing the same issue, the solution which worked for me was
Restart the ec2 instance
Edit
lately, I figure out this issue is happening due to the fewer resources (memory, CPU) available to the EC2 machine. So check available resources to the EC2 machine.

Intermittent connection to SQL server database. “The login is from an untrusted domain and cannot be used with Windows authentication”

I am having some random intermittemt issues connecting to a database which is running on a SQL Server 2008 instance, connected into an Active Directory 2003 domain. It's only suddenly started doing it, all the workstations are Windows 7 Professional 32Bit, and the AD domain is a server 2003 domain controller.
There is no apparent definite fault, its totally intermittent, and if the workstation is rebooted a few times it will then connect Ok with no issues. This is not just one workstation it happens randomly to other workstations sometimes. There is no apparent fault in the network setup they are running on Gb network connections via cisco Gb switch, all other windows features and network drives have no issues.
The SQL Server is running on very new 64bit hardware from Dell so its not a hardware issue. It has been running like this OK for some time and this random connection issue has only just recently started to happen. Could it be the size of the database? the mdf file as grown to 32Gb, and the log file is a whopping 135Gb in size. The database was migrated from a SQL Server 2000 database server about 12 months ago.
Any help would be appreciated
Look at https://blogs.msdn.microsoft.com/sql_protocols/2008/05/02/understanding-the-error-message-login-failed-for-user-the-user-is-not-associated-with-a-trusted-sql-server-connection/
The basic take-away for this ticket is:
If this error message only appears sporadically in an application using Windows Authentication, it may result because the SQL Server cannot contact the Domain Controller to validate the user. This may be caused by high network load stressing the hardware, or to a faulty piece of networking equipment. The next step here is to troubleshoot the network hardware between the SQL Server and the Domain Controller by taking network traces and replacing network hardware as necessary.
The fact that your log file has grown so dramatically, could be some cause for concern. You may want to back up more often, i.e. log and DIFF to reduce your high log requirement (and likely increase speed). Confirm that your daily maintenance tasks are running (index and stats).
In my case, the high write rate (150k writes per hour and about 500k reads per hour) had some bearing on my issues - and reducing the hammering of the database reduced my issues with SQL not "getting to" the AD server timeously.
You could also use a SQL user (to avoid the AD lookup), but getting to the bottom of the issue is probably more in your interest anyway.
It is very hard to understand what may call the issue so I guess you need to do some further investigations (It could be your DNS,AD,ETC,).
SQL Server 2008 has a features to investigate connection issues called Connectivity Ring Buffer
More info can be found here

MySQL Cluster - Excessive network traffic

I am currently in the process of testing a brand new MySQL Cluster on a few VirtualBox VMs. I have successfully configured everything and have 1 management node, 2 data nodes and 3 application nodes working perfectly as far as data consistency is concerned.
The issue is that there appears to be quite excessive network traffic between the ndbd and ndb_mgmd processes on each machine. Memory usage also seems to be quite excessive although this is not as much of an issue. My cluster is not doing anything and yet there is quite a few Kb/s being transferred between the nodes.
Is this normal? And if not, what have I done wrong?