Frequent worker timeout - gunicorn

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.
[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514
Why is this happening? How can I figure out what's going wrong?

We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.
After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..
So, Do:
1) open the gunicorn configuration file
2) set the TIMEOUT to what ever you need - the value is in seconds
NUM_WORKERS=3
TIMEOUT=120
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE

On Google Cloud
Just add --timeout 90 to entrypoint in app.yaml
entrypoint: gunicorn -b :$PORT main:app --timeout 90

Run Gunicorn with --log-level debug.
It should give you an app stack trace.

Is this endpoint taking too many time?
Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.
With gevent, a new call will spawn a new thread, and you app will be able to receive more requests
pip install gevent
gunicon .... --worker-class gevent

The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600
gunicorn --bind=0.0.0.0 --timeout 600 application:app
https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app

WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Another thing that may affect this is choosing the worker type
The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.
When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.
And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
So I suggest you to check what thing slowing down your application in the first place

Could it be this?
http://docs.gunicorn.org/en/latest/settings.html#timeout
Other possibilities could be your response is taking too long or is stuck waiting.

This worked for me:
gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000
If you have eventlet add:
--worker-class=eventlet
If you have gevent add:
--worker-class=gevent

I've got the same problem in Docker.
In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.
I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.
Here is my complete config:
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3

I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed
So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.

You need to used an other worker type class an async one like gevent or tornado see this for more explanation :
First explantion :
You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing
Second one :
The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.

If you are using GCP then you have to set workers per instance type.
Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime

timeout is a key parameter to this problem.
however it's not suit for me.
i found there is not gunicorn timeout error when i set workers=1.
when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.
socket.recv will block my code and that's why it always timeout when workers>1
hope to give some ideas to the people who have some problem with me

For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.

For me, it was because I forgot to setup firewall rule on database server for my Django.

Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".
(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).

Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.

The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:
timeout = 999
Then just run the server while pointing to this configuration file
gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app
note that for this statement to work you need wsgi.py also in the same directory having the following
from myproject import app
if __name__ == "__main__":
app.run()
Cheers!

Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,
proxy_connect_timeout 120s;
proxy_read_timeout 120s;

In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.
for people who come to the same problem I did. My solution was to send it in chunks like this:
ref / html example, separate large files ref
def upload_to_server():
upload_file_path = location
def read_in_chunks(file_object, chunk_size=524288):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(upload_file_path, 'rb') as f:
for piece in read_in_chunks(f):
r = requests.post(
url + '/api/set-doc/stream' + '/' + server_file_name,
files={name: piece},
headers={'key': key, 'allow_all': 'true'})
my flask server:
#app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
def api_set_file_streamed(name):
folder = escape(name) # secure_filename(escape(name))
if 'key' in request.headers:
if request.headers['key'] != key:
return 404
else:
return 404
for fn in request.files:
file = request.files[fn]
if fn == '':
print('no file name')
flash('No selected file')
return 'fail'
if file and allowed_file(file.filename):
file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
if not os.path.exists(file_dir_path):
os.makedirs(file_dir_path)
file_path = os.path.join(file_dir_path, secure_filename(file.filename))
with open(file_path, 'ab') as f:
f.write(file.read())
return 'sucess'
return 404

in case you have changed the name of the django project you should also go to
cd /etc/systemd/system/
then
sudo nano gunicorn.service
then verify that at the end of the bind line the application name has been changed to the new application name

Related

PouchDB Replication throws error upon replication

When I try to replicate a remote couchdb (on ubuntu 14.04- 64 bit) with my local pouchdb, I encouter this strange error.
My couchdb is proxied via nginx and running on https. Traffic from client to nginx is ssl while nginx to couchdb is simple http. Cors requests are enabled in couchdb. Nginx configuration is most similar to couchdb recommended. Sync from database is working fine however getting below errors when debugging via chrome Version 54.0.2840.100 (64-bit) .
Following is the full stack trace of the error.
raven.min.js:2 Error: There was a problem getting docs.
at finishBatch (http://localhost:8100/lib/pouchdb/dist/pouchdb.js:6410:13)
at processQueue (http://localhost:8100/lib/ionic/js/ionic.bundle.js:27879:28)
at http://localhost:8100/lib/ionic/js/ionic.bundle.js:27895:27
at Scope.$eval (http://localhost:8100/lib/ionic/js/ionic.bundle.js:29158:28)
at Scope.$digest (http://localhost:8100/lib/ionic/js/ionic.bundle.js:28969:31)
at http://localhost:8100/lib/ionic/js/ionic.bundle.js:29197:26
at completeOutstandingRequest (http://localhost:8100/lib/ionic/js/ionic.bundle.js:18706:10)
at http://localhost:8100/lib/ionic/js/ionic.bundle.js:18978:7
at d (http://localhost:8100/lib/raven-js/dist/raven.min.js:2:4308) undefineda.(anonymous function) # raven.min.js:2(anonymous function) # ionic.bundle.js:25642(anonymous function) # ionic.bundle.js:22421(anonymous function) # angular.min.js:2processQueue # ionic.bundle.js:27887(anonymous function) # ionic.bundle.js:27895$eval # ionic.bundle.js:29158$digest # ionic.bundle.js:28969(anonymous function) # ionic.bundle.js:29197completeOutstandingRequest # ionic.bundle.js:18706(anonymous function) # ionic.bundle.js:18978d # raven.min.js:2
raven.min.js:2 Paused in lessondb replicate Error: There was a problem getting docs.
at finishBatch (http://localhost:8100/lib/pouchdb/dist/pouchdb.js:6410:13)
at processQueue (http://localhost:8100/lib/ionic/js/ionic.bundle.js:27879:28)
at http://localhost:8100/lib/ionic/js/ionic.bundle.js:27895:27
at Scope.$eval (http://localhost:8100/lib/ionic/js/ionic.bundle.js:29158:28)
at Scope.$digest (http://localhost:8100/lib/ionic/js/ionic.bundle.js:28969:31)
at http://localhost:8100/lib/ionic/js/ionic.bundle.js:29197:26
at completeOutstandingRequest (http://localhost:8100/lib/ionic/js/ionic.bundle.js:18706:10)
at http://localhost:8100/lib/ionic/js/ionic.bundle.js:18978:7
at d (http://localhost:8100/lib/raven-js/dist/raven.min.js:2:4308)
The network logs in chrome show that some requests are cancelled
I am using couchdb version - 1.6.1 and pouchdb version - 5.3.2.
I use following command to replicate dbs:
myDB.replicate.from(remote_db_url,{
live: true,
retry: true,
heartbeat: false
})
Also it would be great if someone can shed some light on heartbeat parameter .
Note: I'm not able to solve the error you describe. Maybe a full stack trace rather than a screenshot might help...
But I will try to shed some light on the heartbeat parameter: Reading the docs already helps a little. See the advanced options for the replicate method:
options.heartbeat: Configure the heartbeat supported by CouchDB which keeps the change connection alive.
So let's look into CouchDB's docs to see what this parameter does:
Networks are a tricky beast, and sometimes you don’t know whether there are no changes coming or your network connection went stale. If you add another query parameter, heartbeat=N, where N is a number, CouchDB will send you a newline character each N milliseconds. As long as you are receiving newline characters, you know there are no new change notifications, but CouchDB is still ready to send you the next one when it occurs.
So basically it seems to be a polling mechanism that sends a message (f.e. a newline) every n milliseconds (where n is the heartbeat value you specify) to make sure the connection between two databases is still working.
Setting the value to false will disable this mechanism.
Regarding the which value can be used for this parameter:
The PouchDB docs further state, that the changes method has a similar parameter described like this:
options.heartbeat: For http adapter only, time in milliseconds for server to give a heartbeat to keep long connections open. Defaults to 10000 (10 seconds), use false to disable the default.

Why does my openshift app timeout when I try to access the URL?

I am trying to set up a BrowserQuest server that runs in openshift
I've been following this readme. Everything seems to go fine, I get to the end and run rhc app show bq and get the following output:
bq # http://bq-plantagenet.rhcloud.com/ (uuid: 55e4311189f5cf028d0000fc)
------------------------------------------------------------------------
Domain: plantagenet
Created: 8:18 AM
Gears: 1 (defaults to small)
Git URL: ssh://55e4311189f5cf028d0000fc#bq-plantagenet.rhcloud.com/~/git/bq.git/
SSH: 55e4311189f5cf028d0000fc#bq-plantagenet.rhcloud.com
Deployment: auto (on git push)
nodejs-0.10 (Node.js 0.10)
--------------------------
Gears: Located with smarterclayton-redis-2.6
smarterclayton-redis-2.6 (Redis)
--------------------------------
From: http://cartreflect-claytondev.rhcloud.com/reflect?github=smarterclayton/openshift-redis-cart
Website: https://github.com/smarterclayton/openshift-redis-cart
Gears: Located with nodejs-0.10
But when I try to access http://bq-plantagenet.rhcloud.com:8080/ in a browser, I get:
The connection has timed out
The server at bq-plantagenet.rhcloud.com is taking too long to respond
My questions are what is going wrong and how can I fix it? Many thanks for your consideration in reading through this and any suggestions you might have for resolving it
You need to access http://bq-plantagenet.rhcloud.com, leave off the port 8080, that is the port you listen on internally. You should also try checking your log files (https://developers.openshift.com/en/managing-log-files.html) to see what errors your application is producing.

Hot reconfiguration of HAProxy still lead to failed request, any suggestions?

I found there are still failed request when the traffic is high using command like this
haproxy -f /etc/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
to hot reload the updated config file.
Here below is the presure testing result using webbench :
/usr/local/bin/webbench -c 10 -t 30 targetHProxyIP:1080
Webbench – Simple Web Benchmark 1.5
Copyright (c) Radim Kolar 1997-2004, GPL Open Source Software.
Benchmarking: GET targetHProxyIP:1080
10 clients, running 30 sec.
Speed=70586 pages/min, 13372974 bytes/sec.
**Requests: 35289 susceed, 4 failed.**
I run command
haproxy -f /etc/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
several times during the pressure testing.
In the haproxy documentation, it mentioned
They will receive the SIGTTOU
611 signal to ask them to temporarily stop listening to the ports so that the new
612 process can grab them
so there is a time period that the old process is not listening on the PORT(say 80) and the new process haven’t start to listen to the PORT (say 80), and during this specific time period, it will cause the NEW connections failed, make sense?
So is there any approach that makes the configuration reload of haproxy that will not impact both existing connections and new connections?
On recent kernels where SO_REUSEPORT is finally implemented (3.9+), this dead period does not exist anymore. While a patch has been available for older kernels for something like 10 years, it's obvious that many users cannot patch their kernels. If your system is more recent, then the new process will succeed its attempt to bind() before asking the previous one to release the port, then there's a period where both processes are bound to the port instead of no process.
There is still a very tiny possibility that a connection arrived in the leaving process' queue at the moment it closes it. There is no reliable way to stop this from happening though.

How to automatically exit/stop the running instance

I have managed to create an instance and ssh into it. However, I have couple of questions regarding the Google Compute Engine.
I understand that I will be charged for the time my instance is running. That is till I exit out of the instance. Is my understanding correct?
I wish to run some batch job (java program) on my instance. How do I make my instance stop automatically after the job is complete (so that I don't get charged for the additional time it may run)
If I start the job and disconnect my PC, will the job continue to run on the instance?
Regards,
Asim
Correct, instances are charged for the time they are running. (to the minute, minimum 10 minutes). Instances run from the time they are started via the API until they are stopped via the API. It doesn't matter if any user is logged in via SSH or not. For most automated use cases users never log in - programs are installed and started via start up scripts.
You can view your running instances via the Cloud Console, to confirm if any are currently running.
If you want to stop your instance from inside the instance, the easiest way is to start the instance with the compute-rw Service Account Scope and use gcutil.
For example, to start your instance from the command line with the compute-rw scope:
$ gcutil --project=<project-id> addinstance <instance name> --service_account_scopes=compute-rw
(this is the default when manually creating an instance via the Cloud Console)
Later, after your batch job completes, you can remove the instance from inside the instance:
$ gcutil deleteinstance -f <instance name>
You can put halt command at the end of your batch script (assuming that you output your results on persistent disk).
After halt the instance will have a state of TERMINATED and you will not be charged.
See https://developers.google.com/compute/docs/pricing
scroll downn to "instance uptime"
You can auto shutdown instance after model training. Just run few extra lines of code after the model training is complete.
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
service = discovery.build('compute', 'v1', credentials=credentials)
# Project ID for this request.
project = 'xyz' # Project ID
# The name of the zone for this request.
zone = 'xyz' # Zone information
# Name of the instance resource to stop.
instance = 'xyz' # instance id
request = service.instances().stop(project=project, zone=zone, instance=instance)
response = request.execute()
add this to your model training script. When the training is complete GCP instance automatically shuts down.
More info on official website:
https://cloud.google.com/compute/docs/reference/rest/v1/instances/stop
If you want to stop the instance using the python script, you can follow this way:
from google.cloud.compute_v1.services.instances import InstancesClient
from google.oauth2 import service_account
instance_client = InstancesClient().from_service_account_file(<location-path>)
zone = <zone>
project = <project>
instance = <instance_id>
instance_client.stop(project=project, instance=instance, zone=zone)
In the above script, I have assumed you are using service-account for authentication. For documentation of libraries used you can go here:
https://googleapis.dev/python/compute/latest/compute_v1/instances.html

redis.conf include: "Bad directive or wrong number of arguments"

I've created this config for redis [/etc/redis/map.conf]:
include /etc/redis/ideal.conf
port 11235
pidfile /var/run/redis-map.pid
logfile /var/log/redis/map.log
dbfilename map.rdb
As you can see, it includes /etc/redis/ideal.conf; this file actually exists and we have read permissions.
Also there is another file, slightly different; consider [/etc/redis/storage.conf]:
include /etc/redis/ideal.conf
pidfile /var/run/redis-storage.pid
port 8000
bind 192.168.0.3
logfile /var/log/redis/storage.log
dbfilename dump_storage.rdb
My problem is: I can launch redis-server with storage.conf (and everything works fine), but map.conf leads to the following error:
Reading the configuration file, at line 1
>>> 'include /etc/redis/ideal.conf'
Bad directive or wrong number of arguments
failed
Version of redis is 2.2.
Where did I go wrong?
Sorry guys.
I was using different instances of Redis.
Instance for storage.conf was launched by /usr/local/bin/redis-server, but map.conf launched by /usr/bin/redis-server; second one is broken.
Thank you anyway.