I have a server hosting a website of mine that has almost zero-traffic.
A few people (< 20) enter the site every day, and a few RSS readers are subscribed to some feeds we put out.
Almost every night, an RSS reader will hit us in the middle of the night and get an exception that the website can't connect to the SQL Server because of a Timeout in the connection.
The details are extremely weird, so I'm looking for some help on what could be the issue, since I don't know where to start looking anymore.
We are using ASP.Net MVC, Entity Framework, and SQL Server 2008 over Windows Server 2008. The machine is a dedicated box we got from a not exactly top-tier provider, so things might be configured non-optimally, or who knows what else.
The box is also pretty small, and has only 1Gb RAM, but it should take the kind of load we have for now...
I'm copying the full call stack below, but first, some of the things we know:
The error always happens when iTunes is querying our site. I believe this should have nothing to do with anything, but the truth is that we only get it from iTunes. My best guess is that this happens because only iTunes queries us at that time of the night when no one else is hitting us.
One of our theories is that the SQL Server and IIS are fighting for memory, and one of them is getting paged to disk out of not being used, and when someone "wakes it up", it takes too long to read everything from disk back into memory. Is this something that could potentially happen? (I'm kind of discarding this since it sounds like a design issue in SQL Server if it were possible)
I also thought about the possibility that we're leaking connections, since we may not be disposing of EF entities appropriately (see my question here). This is the only thing I could find by Googling the problem. I'm discarding this given the extremely low load we have.
This always happens over the night, so it's very likely something related to the fact that nothing happened for a while. For example, I'm pretty sure that when these requests hit, the web server process got recycled and it's starting up / re-JITting everything. The re-JITting doesn't explain the SQL timeout, though.
UPDATE: We attached a profiler as suggested, and it took quite a while before we had a new exception. This is the new stuff we know:
Having the profiler attached enormously reduced the number of errors we got. In fact, after normally getting several per day, we had to wait for 3 or 4 days for this to happen ONCE. Once we stopped the profiler, it went back to the normal error frequency (or even worse). So the profiler has some effect that hides this problem to some extent, but not completely.
Looking at the profiler trace next to the IIS requests log, there is an expected 1-1 correspondence between requests and queries. However, every now and then, I see A LOT of queries being executed that have no correllation at all with the IIS log. In fact, right before the actual bug was logged, I got 750 queries in a period of 3 minutes, all of which were completely unrelated to the IIS logs. The query text look like the kind of unreadable crap that EF generates, and they're not all the same, and they all look just like the queries coming from the website: Same ApplicationName, User, etc. To give an idea how ridiculous this is, the site got about 370 IIS requests that hit the DB, in the course of 2 days
These unexplained queries did not come from the same ClientProcessID as the previous website ones, although they may still have come from the website, if the process got recycled in the meantime. There was almost an hour of no activity between the last explained query, and the first unexplained one.
One of these long streaks of queries that I don't know where they came from came right before the error I got logged, so I believe this is the clue we should be following.
As I expected originally, when the query that threw the error was executed, it came from a different ClientProcessID than the previous one, (8 minutes later than the previous unexplained one, and almost exactly one hour later than the previous IIS one). This means, to me, that the worker process had indeed gotten recycled.
This is something I absolutely don't understand. The IIS log shows that one minute before the error requests, 4 were perfectly served, although the queries for those don't show up in the trace at all. In fact, after those 4 that went well, I had 4 exceptions thrown in quick succession, those 4 ALSO don't show up in the trace (which makes sense since if there was a Timeout in connection the query should have never gotten executed, but I don't see the connections attempts in the trace either)
So, in short, I'm completely clueless about this. I can't find a reason for those hundreds of queries that get run in quick succession, but I believe those must have something to do with the problem.
I also don't know how to diagnose the connection problems...
Or how the Profiler trace may be missing some queries that according to IIS went through fine...
Any ideas?
This is the exception information:
System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
System.Data.EntityException: The underlying provider failed on Open. ---> System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
at System.Data.ProviderBase.DbConnectionPool.GetConnection(DbConnection owningObject)
at System.Data.ProviderBase.DbConnectionFactory.GetConnection(DbConnection owningConnection)
at System.Data.ProviderBase.DbConnectionClosed.OpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory)
at System.Data.ProviderBase.DbConnectionClosed.OpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory)
at System.Data.SqlClient.SqlConnection.Open()
at System.Data.EntityClient.EntityConnection.OpenStoreConnectionIf(Boolean openCondition, DbConnection storeConnectionToOpen, DbConnection originalConnection, String exceptionCode, String attemptedOperation, Boolean& closeStoreConnectionOnFailure)
at System.Data.EntityClient.EntityConnection.OpenStoreConnectionIf(Boolean openCondition, DbConnection storeConnectionToOpen, DbConnection originalConnection, String exceptionCode, String attemptedOperation, Boolean& closeStoreConnectionOnFailure)
--- End of inner exception stack trace ---
at System.Data.EntityClient.EntityConnection.OpenStoreConnectionIf(Boolean openCondition, DbConnection storeConnectionToOpen, DbConnection originalConnection, String exceptionCode, String attemptedOperation, Boolean& closeStoreConnectionOnFailure)
at System.Data.EntityClient.EntityConnection.Open()
at System.Data.Objects.ObjectContext.EnsureConnection()
at System.Data.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
at System.Data.Objects.ObjectQuery`1.System.Collections.Generic.IEnumerable<T>.GetEnumerator()
at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
at System.Data.Objects.ELinq.ObjectQueryProvider.<GetElementFunction>b__1[TResult](IEnumerable`1 sequence)
at System.Data.Objects.ELinq.ObjectQueryProvider.ExecuteSingle[TResult](IEnumerable`1 query, Expression queryRoot)
at System.Data.Objects.ELinq.ObjectQueryProvider.System.Linq.IQueryProvider.Execute[S](Expression expression)
at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source)
at MyProject.Controllers.SitesController.Feed(Int32 id) in C:\...\Controller.cs:line 38
at lambda_method(ExecutionScope , ControllerBase , Object[] )
at System.Web.Mvc.ReflectedActionDescriptor.Execute(ControllerContext controllerContext, IDictionary`2 parameters)
at System.Web.Mvc.ControllerActionInvoker.InvokeActionMethod(ControllerContext controllerContext, ActionDescriptor actionDescriptor, IDictionary`2 parameters)
at System.Web.Mvc.ControllerActionInvoker.<>c__DisplayClassa.<InvokeActionMethodWithFilters>b__7()
at System.Web.Mvc.ControllerActionInvoker.InvokeActionMethodFilter(IActionFilter filter, ActionExecutingContext preContext, Func`1 continuation)
at System.Web.Mvc.ControllerActionInvoker.InvokeActionMethodWithFilters(ControllerContext controllerContext, IList`1 filters, ActionDescriptor actionDescriptor, IDictionary`2 parameters)
at System.Web.Mvc.ControllerActionInvoker.InvokeAction(ControllerContext controllerContext, String actionName)
at System.Web.Mvc.Controller.ExecuteCore()
at System.Web.Mvc.MvcHandler.ProcessRequest(HttpContextBase httpContext)
at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
Any ideas will be enormously appreciated.
Not Enough Memory
This is very likely a Memory problem, perhaps aggravated or triggered by other things, but still inherently a memory problem. there are two other (less likely) possibilities, that you should check and eliminate first (because it is easy to do so):
Easy To Check Possibilities:
You may have "Auto Close" enabled: Auto Close can have exactly this behavior, however it is rare for it to be turned on. To check this, in SSMS right-click on your application database, select "Properties", and then select the "Options" pane. Look at the "Auto Close" entry and make sure that it is set to False. Check tempdb also.
SQL Agent Jobs may be causing it: Check the Agent's History Log to see if there were any jobs consistently running during the events. Remember to check maintenance jobs too, as things like Rebuilding Indexes are frequently cited as performance problems while they are running. These are unlikely candidates now, only because they would not normally be affected by the Profiler.
Why It Looks Like a Memory Problem:
If those do not show anything, then you should check for memory problems. I suspect Memory as the cause in your case because:
You have 1 GB of Memory: Although this is technically above the Minimum for SQL Server, it is way below the recommended for SQL Server, and way below what in my experience is acceptable for production, even for a lightly loaded server.
You are running IIS and SQL Server on the same box: This is not recommended by itself, in large part because of the contention for memory that results, but with only 1 GB of memory it results in IIS, the app, SQL Server, the OS and any other tasks and/or maintenance all fighting for very little memory. The way the Windows manages this is to give memory to the active processes by aggressively taking it away from the non-active processes. It can take many seconds, or even minutes for a large process like SQL Server to get back enough of its memory to be able to completely service a request in this situation.
Profiler made 90% of the problem go away: This is a big clue that memory is likely the problem, because typically, things like Profiler have exactly this effect on this particular problem: the Profiler task keeps the SQL Server just a little bit active all of the time. Frequently, this is just enough activity to either keep it off the OS's "scavenger" list, or at least reduces it's impact somewhat.
How to Check For Memory as the Culprit:
Turn Off the Profiler: Its having a Heisenberg effect on the problem, so you have to turn it off or you will not be able to see the problem reliably.
Run a System Monitor (perfmon.exe) from another box, that remotely connects to the perfomrance collection service on the box that your SQL Server and IIS are running on. you can most easily do this by first removing the three default stats (they are local only), and then add in the needed stats (below), but make sure to change the Computer name in the first drop-down to connect to your SQL box.
Send the collected data to a file by creating a "Counter Log" on perfmon. If you are unfamiliar with this, then the easiest thing to do is probably to collect the data to a tab or comma separated file that you can open with Excel to analyze.
Set up your perfmon to collect to a file and add the following counters to it:
-- Processor\%Processor Time[Total]
-- PhysicalDisk\% Idle Time[for each disk]
-- PhysicalDisk\Avg. Disk Queue Length[for each disk]
-- Memory\Pages/sec
-- Memory\Page Reads/sec
-- Memory\Available MBytes
-- Network Interface\Bytes Total/sec[for each interface in use]
-- Process\% Processor Time[see below]
-- Process\Page Faults/sec[see below]
-- Process\Working Set [see below]
For the Process counters (above) you want to include the sqlserver.exe process, any IIS processes, and any stable application processes. Note that this will ONLY work for "stable" processes. Processes that are continually being re-created as needed, cannot be captured this way because there is no way to specify them before they exist.
Run this collection to a file during the time that the problem most frequently happens. Set the collection interval to something close to 10-15 secs. (this collects a lot of data, but you will need this resolution to pick out the separate events).
After you have one or more incidents, stop the collection and then open your colleced data file with Excel. You will probably have to reformat the timestamp column to be usefully visible and show hours minutes and seconds. Use your IIS log to find the exact time of the incidents, then look at the perfmon data to see what was going on before and after the incident. In particular you want to see if its working set was small before and was large after, with a lot of page faulting in between. That's the clearest sign of this problem.
SOLUTIONS:
Either separate IIS and SQL Server onto two different boxes (preferred) or else add more memory to the box. I would think that 3-4 GB should be a minimum.
What About That Weird EF Stuff?
The problem here is that it is most likely either peripheral or only contributory to your main problem. Remember that Profiler made 90% of your incidents go away, so what remains, may be a different problem, or it may be only the most extreme aggravator of the problem. Because of its behavior I would guess that it is either cycling its cache or there is some other background maintenance of the application server processes.
I would compare the timestamp of the timeout with the execution time of your nightly backup. If they coincide, you could set your RSS feed to be static for that time.
Another thing to try (even though it isn't exactly an answer) is to immediately run sp_who when you get a timeout exception. It won't catch everything (the offending process could be done by the time you run this) but you may get lucky.
You can also fire up SQL Profiler when you head home for the night and step through the activity the next morning if you see the error again. Just be sure to not run it from the server itself (I'm pretty sure it reminds you of this when it starts).
EDIT: Addressing your update.
Is EF updating/creating its cache? It could explain the abundance of queries at one time and why no queries had database hits later.
Other than that, it appears you have a heisenbug. The only thing I can think for you to add is a lot more logging (to a file or the event log).
It smells a cronned thing that runs at the same time.
As RBarryYoung says.. some nightly backup or it could be something else
Do you have root access to the server?
Can you see the crontabs?
Could it be some full text indexing plugin on top of the SQL server that runs its reindexing procedures close to the time you are experiencing the issues?
In my case, when I installed sqlserver 2008 r2 sp3,The problem goes away.
Server:Windows 7+SqlServer 2008 R2 (developer edition)
client:Raspberrypi 3B+ ,Asp.net Core+EF Core
Related
I wrote a web application using python and Flask framework, and set it up on Apache with mod_wsgi.
Today I use JMeter to perform some load testing on this application.
For one web URL:
when I set only 1 thread to send request, the response time is 200ms
when I set 20 concurrent threads to send requests, the response time increases to more than 4000ms(4s). THIS IS UNACCEPTABLE!
I am trying to find the problem, so I recorded the time in before_request and teardown_request methods of flask. And it turns out the time taken to process the request is just over 10ms.
In this URL handler, the app just performs some SQL queries (about 10) in Mysql database, nothing special.
To test if the problem is with web server or framework configuration, I wrote another method Hello in the same flask application, which just returns a string. It performs perfectly under load, the response time is 13ms with 20-thread concurrency.
And when doing the load test, I execute 'top' on my server, there are about 10 apache threads, but the CPU is mostly idle.
I am at my wit's end now. Even if the request are performed serially, the performance should not drop so drastically... My guess is that there is some queuing somewhere that I am unaware of, and there must be overhead besides handling the request.
If you have experience in tuning performance of web applications, please help!
EDIT
About apache configuration, I used MPM worker mode, the configuration:
<IfModule mpm_worker_module>
StartServers 4
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 64
ThreadsPerChild 50
MaxClients 200
MaxRequestsPerChild 0
</IfModule>
As for mod_wsgi, I tried turning WSGIDaemonProcess on and off (by commenting the following line out), the performance looks the same.
# WSGIDaemonProcess tqt processes=3 threads=15 display-name=TQTSERVER
Congratulations! You found the performance problem - not your users!
Analysing performance problems on web applications is usually hard, because there are so many moving parts, and it's hard to see inside the application while it's running.
The behaviour you describe is usually associated with a bottleneck resource - this happens when there's a particular resource that can't keep up, so queues requests, which tends to lead to a "hockey stick" curve with response times - once you hit the point where this resource can't keep up, the response time goes up very quickly.
20 concurrent threads seems low for that to happen, unless you're doing a lot of very heavy lifting on the page.
First place to start is TOP - while CPU is low, what's memory, disk access etc. doing? Is your database running on the same machine? If not, what does TOP say on the database server?
Assuming it's not some silly hardware thing, the next most likely problem is the database access on that page. It may be that one query is returning literally the entire database when all you want is one record (this is a fairly common anti pattern with ORM solutions); that could lead to the behaviour you describe. I would use the Flask logging framework to record your database calls (start, end, number of records returned), and look for anomalies there.
If the database is performing well under load, it's either the framework or the application code. Again, use logging statements in the code to trace the execution time of individual blocks of code, and keep hunting...
It's not glamorous, and can be really tedious - but it's a lot better that you found this before going live!
Look at using New Relic to identify where the bottleneck is. See overview of it and discussion of identifying bottlenecks in my talk:
http://lanyrd.com/2012/pycon/spcdg/
Also edit your original question and add the mod_wsgi configuration you are using, plus whether you are using Apache prefork or worker MPM as you could be doing something non optimal there.
This question concerns the use of Connector\Net, MySQL, and the visual studio integration of the typed data set designer. As a beginner I have almost entirely default settings, and am using InnoDB for tables.
It appears that on certain tables the designer slows to a grind on adding new queries to the adapter for me. Looking in the Workbench administrator it appears to be running through a loop of similar queries fetching the schema of the tables, checking permissions, deleting tmp tables, etc. I presume it is trying to validate the syntax of the query that I'm adding and get details to implement the XSD spec.
I am not sure how to proceed in debugging this further. The slowdown becomes essentially equivalent to a total freeze. If I had to take a total guess I'd say there's looping going on rather than a slow sequence of events. I'd say a timeout exception gets swallowed, and a loop repeats. If you shutdown the service, the logon creds dialogue will pop up in the designer but you can't cancel out of it without shutting down the VS process which seems sloppy.
Any ideas? Is this functionality being used by other people or am I in no man's land?
I'm doing my first foray with mysql and I have a doubt about how to handle the connection(s) my applications has.
What I am doing now is opening a connection and keeping it alive until I terminate my program. I do a mysql_ping() every now and then and the connection is started with MYSQL_OPT_RECONNECT.
The other option (I can think of), would be to start a new connection before doing anything that requires my connection to the database and closing it after I'm done with it.
What are the pros and cons of these two approaches?
what are the "side effects" of a long connection?
What is the most used method of handling this?
Cheers ;)
Some extra details
At this point I am keeping the connection alive and I ping it every now and again to now it's status and reconnect if needed.
In spite of this, when there is some consistent concurrency with queries happening in quick succession, I get a "Server has gone away" message and after a while the connection is re-established.
I'm left wondering if this is a side effect of a prolonged connection or if this is just a case of bad mysql server configuration.
Any ideas?
In general there is quite some amount of overhead incurred when opening a connection. Depending on how often you expect this to happen it might be ok, but if you are writing any kind of application that executes more than just a very few commands per program run, I would recommend a connection pool (for server type apps) or at least a single or very few connections from your standalone app to be kept open for some time and reused for multiple transactions.
That way you have better control over how many connections get opened at the application level, even before the database server gets involved. This is a service an application server offers you, but it can also be rolled up rather easily if you want to keep it smaller.
Apart from performance reasons a pool is also a good idea to be prepared for peaks in demand. When a lot of requests come in and each of them tries to open a separate connection to the database - or as you suggested even more (per transaction) - you are quickly going to run out of resources. Keep in mind that every connection consumes memory inside MySQL!
Also you want to make sure to use a non-root user to connect, because if you don't (I think it is tied to the MySQL SUPER privilege), you might find yourself locked out. MySQL reserves at least one connection for an administrator for problem fixing, but if your app connects with that privilege, all connections would already be used up when you try to put out the fire manually.
Unless you are worried about having too many connections open (i.e. over 1,000), you she leave the connection open. There is overhead in connecting/reconnecting that will only slow things down. If you know you are going to need the connection to stay open for a while, run this query instead of pinging periodically:
SET SESSION wait_timeout=#
Where # is the number of seconds to leave an idle connection open.
What kind of application are you writing? If it's a webscript: keep it open. If it's an executable, pool your connections (if necessary, most of the times a singleton will do).
I've written a C program thats running multiple threads and uses MySQL. After some testing i repeatedly saw the error (with hours between) "Mysql server gone away", so i maximized the wait_timeout setting of mysql. But now i get the error "Lost connection to MySQL server during query". These errors only occured when i run the program on a multiple core processor.
Maybe you guys know whats wrong or what i have to do to run my threaded program?
If you've got a multithreaded program that behaves differently on a 1-core system and a multicore system (works on 1-core and has bugs on multicore), it's written incorrectly: that's a sure sign of a race condition. It means the code is actually incorrect, and if scheduled just wrong will trample on its own data, and this is actually happening in practice on the multicore system and not on the 1-core system.
Actually, the same problem could happen on the 1-core system too, it's just less likely and more rare because the threads can't be scheduled truly simultaneously, so one thread has to preempt the other at just the wrong time, for you to see the buggy behavior. This is why if you're writing multithreaded code, you should always test and debug it on a multicore host. You're much more likely to actually see the evidence of race conditions; running on a 1-core host they can remain hidden for much longer.
I don't know what libraries you're using, but they don't look thread-safe or you're not using them in a thread-safe fashion.
One of the more interesting "features" in Coldfusion is how it handles external requests. The basic gist of it is that when a query is made to an external source through <cfquery> or or any other external request like that it passes the external request on to a specific driver and at that point CF itself is unable to suspend it. Even if a timeout is specified on the query or in the cfsetting it is flatly ignored for all external requests.
http://www.coldfusionmuse.com/index.cfm/2009/6/9/killing.threads
So with that in mind the issue we've run into is that somehow the communication between our CF server and our mySQL server sometimes goes awry and leaves behind hung threads. They have the following characteristics.
The hung thread shows up in CF and cannot be killed from FusionReactor.
There is no hung thread visible in mySQL, and no active running query (just the usual sleeps).
The database is responding to other calls and appears to be operating correctly.
Max connections have not been reached for the DB nor the user.
It seems to me the only likely candidate is that somehow CF is making a request, mySQL is responding to that request but with an answer which CF ignores and continues to keep the thread open waiting for a response from mySQL. That would explain why the database seems to show no signs of problems, but CF keeps a thread open waiting for the mysterious answer.
Usually these hung threads appear randomly on otherwise working scripts (such as posting a comment on a news article). Even while one thread is hung for that script, other requests for that script will go through, which would imply that the script isn't neccessarily at fault, but rather the condition faced when the script was executed.
We ran some test to determine that it was not a mysql generated max_connections error... we created a user, gave it 1 max connections, tied that connection with a sleep(1000) query and executed another query. Unfortunately, it correctly errored out without generating a hung thread.
So, I'm left at this point with absolutely no clue what is going wrong. Is there some other connection limit or timeout which could be causing the communication between the servers to go awry?
One of the things you should start to look at is the hardware between the two servers. It is possible that you have a router or bridge or NIC that is dropping occasional packets. This can result in the mySQL box thinking it has completed the task while the CF server sits there and waits for a complete response indefinitely, creating a hung thread.
3com has some details on testing for packet loss here: http://support.3com.com/infodeli/tools/netmgt/tncsunix/product/091500/c11ploss.htm#22128
We had a similar problem with a MS SQL server. There, the root cause was a known issue in which, for some reason, the server thinks it's shutting down, and the thread hangs (even though the server is, obviously, not shutting down).
We weren't able to eliminate the problem, but were able to reduce it by turning off pooled DB connections and fiddling with the connection refresh rate. (I think I got that label right -- no access to administrator at my new employment.) Both are in the connection properties in Administrator.
Just a note: The problem isn't entirely with CF. The problem, apparently, affects all Java apps. Which does not, in any way, reduce how annoyed I get by this.
Long story short, but I believe the caused was due to Coldfusion's CF8 image processing. It was just buggy and now in CF9 I have never seen that problem again.