Server hangs all requests after a while

Server hangs all requests after a while - mysql

Our Rails 4.0 application (Ruby 2.1.2) is running on Nginx with Puma 2.9.0.
I recently noticed that all requests to our application hang after a while (usually 1 or 2 days).
When checking the log, which is set to debug mode, I noticed the following log stack up:
[2014-10-11T00:02:31.727382 #23458] INFO -- : Started GET "/" for ...
It does mean that requests actually hit the Rails app but somehow it isn't proceeded, while normally it would be:
I, [2014-10-11T00:02:31.727382 #23458] INFO -- : Started GET "/" for ....
I, [2014-10-11T00:02:31.729393 #23458] INFO -- : Processing by HomeController#index as HTML
My puma config is the following:
threads 16,32
workers 4
Our application is only for internal usage as now, so the RPM is very low, and none of the requests are take longer than 2s.
What is/are the reasons that could lead to this problem? (puma config, database connection, etc.)
Thank you in advance.
Update:
After installing the gem rack_timer to log the time spent on each middleware, I realized that our requests has been stuck at the ActiveRecord::QueryCache when the hang occurred, with huge amount of time on it:
Rack Timer (incoming) -- ActiveRecord::QueryCache: 925626.7731189728 ms
I removed this middleware for now and it seems to be back to normal. However, I understand the purpose of this middleware is to increase the performance, so removing it is just a temporary solution. Please help me find out the possible cause of this issue.
FYI, we're using mysql (5.1.67) with adapter mysql2 (0.3.13)

It could be a symptom of RAM starvation due to the query cache getting too big. We saw this in one of our apps running on Heroku. The default query cache is set to 1000. Lowering the limit eased the RAM usage for us with no noticeable performance degradation:
database.yml:
default: &default
adapter: postgresql
pool: <%= ENV["DB_POOL"] || ENV['MAX_THREADS'] || 5 %>
timeout: 5000
port: 5432
host: localhost
statement_limit: <%= ENV["DB_STATEMENT_LIMIT"] || 200 %>
However searching for "activerecord querycache slow" returns other causes, such as perhaps outdated versions of Ruby or Puma or rack-timeout: https://stackoverflow.com/a/44158724/126636
Or maybe a too large value for read_timeout: https://stackoverflow.com/a/30526430/126636

Related

Heroku FlyingSphinx Configuration needed for ThinkingSphinx::Configuration.instance.controller.running?

We ported our Rails 5.2 app to Heroku and were able to get almost everything working with FlyingSphinx.
Search and indexing work well but as a convenience to our users, we try to let them know when the daemon is down for service or if we're re-indexing.
Previously, we were able to use
ThinkingSphinx::Configuration.instance.controller.running?
But this always returns false on Heroku even if the daemon is running.
Our thinking_sphinx.yml doesn't specify file locations or information on where the pid is, so I suspect this may be the issue; however, I can't find anywhere that would explain what to use in thinking_sphinx.yml for Heroku/FlyingSphinx, if it's at all necessary https://freelancing-gods.com/thinking-sphinx/v3/advanced_config.html.
Our thinking_sphinx.yml looks like this now:
common: &common
mem_limit: 40M
64bit_timestamps: true
development:
<<: *common
test:
<<: *common
mysql41: 9307
quiet_deltas: true
staging:
<<: *common
quiet_deltas: true
production:
<<: *common
version: '2.2.11'
quiet_deltas: true
Suggestions?

Ah, I've not had this requested before, but it's definitely possible:
require "flying_sphinx/commands" if ENV["FLYING_SPHINX_IDENTIFIER"]
ThinkingSphinx::Commander.call(
:running,
ThinkingSphinx::Configuration.instance,
{}
)
When this is called locally, it'll check the daemon via the pid file, but when it's called on a Heroku app using Flying Sphinx, it talks to the Flying Sphinx API to get the running state. Hence, it's important to only run the require call for Heroku-hosted environments - there's no point having local/test envs calling the Flying Sphinx API.
Setting file/pid locations in Heroku/Flying Sphinx environments are mostly not going to do anything, because Flying Sphinx overwrites these anyway to match the standardised approach on their servers. The exceptions are for stopfiles/exceptions/etc, and those corresponding files are uploaded to Flying Sphinx so the daemon there can be configured appropriately.

fatal: the remote hung up unexpectedly. Can’t push things to my git repo

My first problem looked like this:
Writing objects: 60% (9/15)
It freezed there for some time with very low upload speed (in kb/s), then, after long time, gave this message:
fatal: the remote end hung up unexpectedly
Everything up-to-date
I found something what seemed to be a solution:
git config http.postBuffer 524288000
This created a new problem that looks like this:
MacBook-Pro-Liana:LC | myWebsite Liana$ git config http.postBuffer 524288000
MacBook-Pro-Liana:LC | myWebsite Liana$ git push -u origin master
Enumerating objects: 15, done.
Counting objects: 100% (15/15), done.
Delta compression using up to 4 threads
Compressing objects: 100% (14/14), done.
Writing objects: 100% (15/15), 116.01 MiB | 25.16 MiB/s, done.
Total 15 (delta 2), reused 0 (delta 0)
error: RPC failed; curl 56 LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 54
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
Everything up-to-date
Please help, I have no idea what’s going on...

First, Git 2.25.1 made it clear that:
Users in a wide variety of situations find themselves with HTTP push problems.
Oftentimes these issues are due to antivirus software, filtering proxies, or other man-in-the-middle situations; other times, they are due to simple unreliability of the network.
This works for none of the aforementioned situations and is only useful in a small, highly restricted number of cases: essentially, when the connection does not properly support HTTP/1.1.
Raising this is not, in general, an effective solution for most push problems, but can increase memory consumption significantly since the entire buffer is allocated even for small pushes.
Second, it depends on your actual remote (GitHub? GitLab? BitBucket? On-premise server). Said remote server might have an incident in progress.

DIY cartridge stops on git push

I've been developing an application for some weeks, and it's been running in a OpenShift small gear with DIY 0.1 + PostgreSQL cartridges for several days, including ~5 new deployments. Everything was ok and a new deploy stopped and started everything in seconds.
Nevertheless today pushing master as usual stops the cartridge and it won't start. This is the trace:
Counting objects: 2688, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (1930/1930), done.
Writing objects: 100% (2080/2080), 10.76 MiB | 99 KiB/s, done.
Total 2080 (delta 1300), reused 13 (delta 0)
remote: Stopping DIY cartridge
fatal: The remote end hung up unexpectedly
fatal: The remote end hung up unexpectedly
Logging in with ssh and running the start action hook manually fails because database is stopped. Restarting the gear makes everything work again.
The failing deployment has nothing to do with it, since it only adds a few lines of code, nothing about configuration or anything that might break the boot.
Logs (at $OPENSHIFT_LOG_DIR) reveal nothing. Quota usage seems fine:
Cartridges Used Limit
---------------------- ------ -----
diy-0.1 postgresql-9.2 0.6 GB 1 GB
Any suggestions about what could I check?

Oh, dumb mistake. My last working deployment involved a change in the binary name, which now matches the gear name. stop script, with ps grep and so on was wrong, not killing only the application but also the connection. Changing it fixed the issue.
Solution inspired by this blogpost.

Concurrent requests to ruby REST API not possible

I am trying to create some small REST API using ruby with Sinatra gem running on thin server. The point is to get an idea about how easy/hard is to build such REST API consisting of micro web services and to compare this with other programming languages / technologies available at Amazon's AWS. I have created one quite easily, here's the code (just minimal working project, not yet considering any kind of optimization):
require 'sinatra'
require 'mysql'
require 'json'
set :environment, :development
db_host = 'HOST_URL'
db_user = 'USER'
db_pass = 'PASSWORD'
db_name = 'DB_NAME'
db_enc = 'utf8'
select = 'SELECT * FROM table1 LIMIT 30'
db = Mysql.init
db.options Mysql::SET_CHARSET_NAME, db_enc
db = db.real_connect db_host, db_user, db_pass, db_name
get '/brands' do
rs = db.query select
#db.close
result = []
rs.each_hash do |row|
result.push row
end
result.to_json
end
Running this with ruby my_ws.rb starts Sinatra running on thin, no problem.
Using curl from my terminal like curl --get localhost:4567/brands is also not a problem returning the desired JSON response.
The real problem I am tackling now for few hours already (and searching on Google of course, reading lot of resources also here on SO) is when I try to benchmark the micro WS using Siege with more concurrent users:
sudo siege -b http://localhost:4567/brands -c2 -r2
This should run in benchnark mode issuing 2 concurrent request (-c2 switch) 2 times (-r2 switch). In this case I always get an error in the console stating Mysql::ProtocolError - invalid packet: sequence number mismatch(102 != 2(expected)): while the number 102 is always different on each run. If I run the benchmark only for one user (one concurrent request, i.e. no concurrency at all) I can run it even 1000 times with no errors (sudo siege -b http://localhost:4567/brands -c1 -r1000).
I tried adding manual threading into my code like:
get '/brands' do
th = Thread.new do
rs = db.query select
#db.close
result = []
rs.each_hash do |row|
result.push row
end
result.to_json
end
th.join
th.value
end
but with no help.
From what I have found:
Sinatra is multithreaded by default
thin is also multithreaded if run from Sinatra if run by ruby script.rb
multithreading seems to have no effect on DB queries - here it looks like no concurrency is possible
I'm using ruby-mysql gem as I have found out it's newer then (just) mysql gem but in the end have no idea which one to use (found old articles to use mysql and some other to use ruby-mysql instead).
Any idea on how to run concurrent requests to my REST API? I need to benchmark and compare it with other languages (PHP, Python, Scala, ...).

The problem is solved with two fixes.
The first one is by replacing the mysql adapter with mysql2.
The second one was the real cause of the problem: the MySQL connection was created once for runtime before the I could even dive into the thread (i.e. before the code for route was executed) causing (logically) connection locks.
Now with mysql2 and connection to DB moved under the route execution it is all working perfectly fine even for 250 concurrent requests! The final code:
require 'sinatra'
require 'mysql2'
require 'json'
set :environment, :development
db_host = 'HOST_URL'
db_user = 'USER'
db_pass = 'PASSWORD'
db_name = 'DB_NAME'
db_enc = 'utf8'
select = 'SELECT * FROM table1 LIMIT 30'
get '/brands' do
result = []
Mysql2::Client.new(:host => db_host, :username => db_user, :password => db_pass, :database => db_name, :encoding => db_enc).query(select).each do |row|
result.push row
end
result.to_json
end
Running sudo siege -b http://localhost:4567/brands -c250 -r4 gives me now:
Transactions: 1000 hits
Availability: 100.00 %
Elapsed time: 1.54 secs
Data transferred: 2.40 MB
Response time: 0.27 secs
Transaction rate: 649.35 trans/sec
Throughput: 1.56 MB/sec
Concurrency: 175.15
Successful transactions: 1000
Failed transactions: 0
Longest transaction: 1.23
Shortest transaction: 0.03

Frequent worker timeout

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.
[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514
Why is this happening? How can I figure out what's going wrong?

We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.
After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..
So, Do:
1) open the gunicorn configuration file
2) set the TIMEOUT to what ever you need - the value is in seconds
NUM_WORKERS=3
TIMEOUT=120
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE

On Google Cloud
Just add --timeout 90 to entrypoint in app.yaml
entrypoint: gunicorn -b :$PORT main:app --timeout 90

Run Gunicorn with --log-level debug.
It should give you an app stack trace.

Is this endpoint taking too many time?
Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.
With gevent, a new call will spawn a new thread, and you app will be able to receive more requests
pip install gevent
gunicon .... --worker-class gevent

The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600
gunicorn --bind=0.0.0.0 --timeout 600 application:app
https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app

WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Another thing that may affect this is choosing the worker type
The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.
When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.
And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
So I suggest you to check what thing slowing down your application in the first place

Could it be this?
http://docs.gunicorn.org/en/latest/settings.html#timeout
Other possibilities could be your response is taking too long or is stuck waiting.

This worked for me:
gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000
If you have eventlet add:
--worker-class=eventlet
If you have gevent add:
--worker-class=gevent

I've got the same problem in Docker.
In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.
I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.
Here is my complete config:
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3

I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed
So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.

You need to used an other worker type class an async one like gevent or tornado see this for more explanation :
First explantion :
You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing
Second one :
The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.

If you are using GCP then you have to set workers per instance type.
Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime

timeout is a key parameter to this problem.
however it's not suit for me.
i found there is not gunicorn timeout error when i set workers=1.
when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.
socket.recv will block my code and that's why it always timeout when workers>1
hope to give some ideas to the people who have some problem with me

For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.

For me, it was because I forgot to setup firewall rule on database server for my Django.

Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".
(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).

Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.

The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:
timeout = 999
Then just run the server while pointing to this configuration file
gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app
note that for this statement to work you need wsgi.py also in the same directory having the following
from myproject import app
if __name__ == "__main__":
app.run()
Cheers!

Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,
proxy_connect_timeout 120s;
proxy_read_timeout 120s;

In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.
for people who come to the same problem I did. My solution was to send it in chunks like this:
ref / html example, separate large files ref
def upload_to_server():
upload_file_path = location
def read_in_chunks(file_object, chunk_size=524288):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(upload_file_path, 'rb') as f:
for piece in read_in_chunks(f):
r = requests.post(
url + '/api/set-doc/stream' + '/' + server_file_name,
files={name: piece},
headers={'key': key, 'allow_all': 'true'})
my flask server:
#app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
def api_set_file_streamed(name):
folder = escape(name) # secure_filename(escape(name))
if 'key' in request.headers:
if request.headers['key'] != key:
return 404
else:
return 404
for fn in request.files:
file = request.files[fn]
if fn == '':
print('no file name')
flash('No selected file')
return 'fail'
if file and allowed_file(file.filename):
file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
if not os.path.exists(file_dir_path):
os.makedirs(file_dir_path)
file_path = os.path.join(file_dir_path, secure_filename(file.filename))
with open(file_path, 'ab') as f:
f.write(file.read())
return 'sucess'
return 404

in case you have changed the name of the django project you should also go to
cd /etc/systemd/system/
then
sudo nano gunicorn.service
then verify that at the end of the bind line the application name has been changed to the new application name

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008