lost connection mysql in C - mysql

I've written a C program thats running multiple threads and uses MySQL. After some testing i repeatedly saw the error (with hours between) "Mysql server gone away", so i maximized the wait_timeout setting of mysql. But now i get the error "Lost connection to MySQL server during query". These errors only occured when i run the program on a multiple core processor.
Maybe you guys know whats wrong or what i have to do to run my threaded program?

If you've got a multithreaded program that behaves differently on a 1-core system and a multicore system (works on 1-core and has bugs on multicore), it's written incorrectly: that's a sure sign of a race condition. It means the code is actually incorrect, and if scheduled just wrong will trample on its own data, and this is actually happening in practice on the multicore system and not on the 1-core system.
Actually, the same problem could happen on the 1-core system too, it's just less likely and more rare because the threads can't be scheduled truly simultaneously, so one thread has to preempt the other at just the wrong time, for you to see the buggy behavior. This is why if you're writing multithreaded code, you should always test and debug it on a multicore host. You're much more likely to actually see the evidence of race conditions; running on a 1-core host they can remain hidden for much longer.
I don't know what libraries you're using, but they don't look thread-safe or you're not using them in a thread-safe fashion.

Related

Couchbase server: Slow GET operation on connection

I am using couchbase-server community edition version 4.0.0. I find the below logs in the babysitter.log file repeatedly. The frequency is higher when the cluster is under load.
WARNING 473: Slow GET operation on connection (10.3.4.14:55846 => 10.3.9.13:11210): 4325 ms
WARNING 1057: Slow DELETE operation on connection (10.3.2.23:46152 => 10.3.9.13:11210): 1280 ms
I find little to no documentation for this warning log. What do these logs mean? How can I further debug the cause for slow operation?
At some point, we added some logging to Couchbase to help identify bigger problems that come about from resource exhaustion by correlating to smaller problems that happen earlier. These are some of those warnings.
They are generally safe to ignore. It does indicate that your system might be under heavy load and perhaps the process isn't getting enough time to send/receive/handle results.
I'd also say the 4.0 version is quite old at this stage and Couchbase has been much improved and is shipping much newer runtimes. In particular, this is coming from an erlang process, one of the things we've updated the runtime on. I expect you'll see less of this and perhaps a little more detail.

ASP .Net 2, Classic Pipeline on IIS 8 64 bit scalability issues

Apologies for the fairly generic nature of the question - I'm simply hoping someone can contribute some suggestions and/or ideas as I'm out of both!
The background:
We run a fairly large (35M hits/month, peak around 170 connections/sec) site which offers free software downloads (stricly legal) and which is written in ASP .NET 2 (VB .Net :( ). We have 2 web servers, sat behind a dedicated hardware load balancer and both servers are fairly chunky machines, Windows Server 2012 Pro 64 bit and IIS 8. We serve extensionless URLs by using a custom 404 page which parses out the requested URL and Server.Transfers appropriately. Because of this particular component, we have to run in classic pipeline mode.
DB wise we use MySQL, and have two replicated DBs, reads are mainly done from the slave. DB access is via a DevArt library and is extensively cached.
The Problem:
We recently (past few months) moved from older servers, running Windows 2003 Server and IIS6. In the process, we also upgraded the Devart Component and MySql (5.1). Since then, we have suffered intermitted scalability issues, which have become significantly worse as we have added more content. We recently increased the number of programs from 2000 to 4000, and this caused response times to increase from <300ms to over 3000ms (measured with NewRelic). This to my mind points to either a bottleneck in the DB (relatively unlikely, given the extensive caching and from DB monitoring) or a badly written query or code problem.
We also regularly see spikes which seem to coincide with cache refreshes which could support the badly written query argument - unfortunately all caching is done for x minutes from retrieval so it can't always be pinpointed accurately.
All our caching uses locks (like this What is the best way to lock cache in asp.net?), so it could be that one specific operation is taking a while and backing up requests behind it.
The problem is... I can't find it!! Can anyone suggest from experience some tools or methods? I've tried to load test, I've profiled the code, I've read through it line by line... NewRelic Pro was doing a good job for us, but the trial expired and for political reasons we haven't purchased a full licence yet. Maybe WinDbg is the way forward?
Looking forward to any insight anyone can add :)
It is not a good idea to guess on a solution. Things could get painful or expensive quickly. You really should start with some standard/common triage techniques and make an educated decision.
Standard process for troubleshooting performance problems on a data driven app go like this:
Review DB indexes (unlikely) and tune as needed.
Check resource utilization: CPU, RAM. If your CPU is maxed-out, then consider adding/upgrading CPU or optimize code or split your tiers. If your RAM is maxed-out, then consider adding RAM or split your tiers. I realize that you just bought new hardware, but you also changed OS and IIS. So, all bets are off. Take the 10 minutes to confirm that you have enough CPU and RAM, so you can confidently eliminate those from the list.
Check HDD usage: if your queue length goes above 1 very often (more than once per 10 seconds), upgrade disk bandwidth or scale-out your disk (RAID, multiple MDF/LDFs, DB partitioning). Check this on each MySql box.
Check network bandwidth (very unlikely, but check it anyway)
Code: a) Consider upgrading to .net 3.5 (or above). It was designed for better scalability and has much better options for caching. b) Use newer/improved caching. c) pick through the code for harsh queries and DB usage. I have had really good experiences with RedGate Ants, but equiv. products work good too.
And then things get more specific to your architecture, code and platform.
There are also some locking mechanisms for the Application variable, but they are rarely the cause of lockups.
You might want to keep an eye on your pool recycle statistics. If you have a memory leak (or connection leak, etc) IIS might seem to freeze when the pool tops-out and restarts.

Unexpected CPU spikes

Im working with SQL Server 2008 R2. In development environment time to time i can see that CPU is under load for couple of minutes (around 55%-80% while normal is 1-2%. yes really- normal load in my development environment is almost none!). During this CPU pressure time automated tests sometimes gets timeout errors.
Just experienced timeout in activity monitor. looks like this:
Tipicaly during these pressure moments its looks like this:
Problem is that i cant understand why it is happening! There is continuously executed automated tests, but they are not making heavy workload. During performance tests system works good and if it slows down there is always good explanation for that.
Im trying to resolve issue by
Running trace, but during those CPU spikes there is "nothing special" going on. No expensive queries.
Using SQL Activity Monitor- everything seems normal, except CPU (just like 1-2 waiting tasks, low I/O, ~5 requests/sec). Recent expensive queries are not that expensive.
Querying data. Using famous sp_WhoIsActive and sys.dm_exec_requests. As far i understand- nothing unusual again..
About my server
There is small number of databases and i do know them good.
Using Service Broker.
Trace is running most of the time.
I do suspect that it is some background process that is making problems. But i dont really get it.. Can you please give some hints/ideas how to resolve this?
It could be some internal SQL Server job. Like a big index rebuild.
Wait for the spike and run sp_who2 'active'. Check the column CPU Time.
Actually, how are you 100% sure that SQL is the responsible? Couldn't it be a SO issue?
I have faced the same issues, and have raised case with Microsoft.
Microsoft guy told there is no issue from SQL DB side , if cpu is spiked. Finally issue is resolved by Microsoft , actually issue was that on IIS, not SQL Server.
Every 29 days IIS need to be restart, So that you will get better performance on Application.

What causes mysterious hanging threads in Colfusion -> mysql communication

One of the more interesting "features" in Coldfusion is how it handles external requests. The basic gist of it is that when a query is made to an external source through <cfquery> or or any other external request like that it passes the external request on to a specific driver and at that point CF itself is unable to suspend it. Even if a timeout is specified on the query or in the cfsetting it is flatly ignored for all external requests.
http://www.coldfusionmuse.com/index.cfm/2009/6/9/killing.threads
So with that in mind the issue we've run into is that somehow the communication between our CF server and our mySQL server sometimes goes awry and leaves behind hung threads. They have the following characteristics.
The hung thread shows up in CF and cannot be killed from FusionReactor.
There is no hung thread visible in mySQL, and no active running query (just the usual sleeps).
The database is responding to other calls and appears to be operating correctly.
Max connections have not been reached for the DB nor the user.
It seems to me the only likely candidate is that somehow CF is making a request, mySQL is responding to that request but with an answer which CF ignores and continues to keep the thread open waiting for a response from mySQL. That would explain why the database seems to show no signs of problems, but CF keeps a thread open waiting for the mysterious answer.
Usually these hung threads appear randomly on otherwise working scripts (such as posting a comment on a news article). Even while one thread is hung for that script, other requests for that script will go through, which would imply that the script isn't neccessarily at fault, but rather the condition faced when the script was executed.
We ran some test to determine that it was not a mysql generated max_connections error... we created a user, gave it 1 max connections, tied that connection with a sleep(1000) query and executed another query. Unfortunately, it correctly errored out without generating a hung thread.
So, I'm left at this point with absolutely no clue what is going wrong. Is there some other connection limit or timeout which could be causing the communication between the servers to go awry?
One of the things you should start to look at is the hardware between the two servers. It is possible that you have a router or bridge or NIC that is dropping occasional packets. This can result in the mySQL box thinking it has completed the task while the CF server sits there and waits for a complete response indefinitely, creating a hung thread.
3com has some details on testing for packet loss here: http://support.3com.com/infodeli/tools/netmgt/tncsunix/product/091500/c11ploss.htm#22128
We had a similar problem with a MS SQL server. There, the root cause was a known issue in which, for some reason, the server thinks it's shutting down, and the thread hangs (even though the server is, obviously, not shutting down).
We weren't able to eliminate the problem, but were able to reduce it by turning off pooled DB connections and fiddling with the connection refresh rate. (I think I got that label right -- no access to administrator at my new employment.) Both are in the connection properties in Administrator.
Just a note: The problem isn't entirely with CF. The problem, apparently, affects all Java apps. Which does not, in any way, reduce how annoyed I get by this.
Long story short, but I believe the caused was due to Coldfusion's CF8 image processing. It was just buggy and now in CF9 I have never seen that problem again.

Fatal errors in live servers

I'm writing some client/server software and I'm facing the following design issue. Normally, I use a VERIFY macro very liberally - if something is wrong in an user's machine, I want the software to fail and log the error so it can be fixed. I was never a fan of ignoring any kind of errors.
However, I'm now writing a server. If the server dies, many clients go down, so the server should die as little as possible. Therefore, I don't know how to treat some conditions that I'd treat as fatal exceptions otherwise.
For example, I get a network packet from an user who isn't logged in. Even though it shouldn't happen, I have enough experience to know "impossible" errors do happen from time to time. So I'm pretty sure if I do a fatal error on these cases, the server WILL crash eventually. On the other hand, I could log and ignore the error and continue, but I'm afraid some bugs may go undetected this way.
What would you do in a situation like this one?
If you can recover from the error, than obviously it wasn't fatal. I can't see the benefit of failing if you can log the error and continue execution - the most important thing is that you've captured the error on log. If you can recover and continue to operate as normal, than that is the best course.
You should implement in addition a notification system (server monitoring) that depending on the error level would notify you in varying degrees of urgency so you'd pick up as soon as possible on something time critical. There are generic system like that for servers, such as Nagios and Munin. You should have look at what they do and see if you can take something from them and implement / integrate it into your system.
Regardless, you should try to make sure client instances are as sandboxed as possible. A client thread going down shouldn't take down the entire server - ever (at least in theory).