supervisor (with gunicorn) stops logging after http error 500

supervisor (with gunicorn) stops logging after http error 500 - gunicorn

I am using supervisor (3.2.0-2ubuntu0.1) to manage gunicorn with this very common configuration:
[program:app]
command = sudo gunicorn -w 1 -b 0.0.0.0:8000 application:app --error-logfile /var/log/gunicorn/error.log --access-logfile /var/log/gunicorn/access.log
directory = /home/ubuntu/app
user = ubuntu
Supervisor captures correctly logs from gunicorn and gunicorn generates correctly its own logs.
However, as soon as there is a 500 in the underlying api served by gunicorn, supervisor stops capturing the logs (while gunicorn captures correctly the issue in its error.log).
How do I fix this?

Turns out the issue was with the worker in python itself. If you try to log something that the logger cannot interpret, the logger becomes foobared and any further attempt to log is doomed.

Related

Unable to connect worker node to master using K3S

I am trying to setup a K3S cluster for learning purposes but I am having trouble connecting the master node with agents. I have looked several tutorials and discussions on this but I can't find a solution. I know I am probably missing something obvious (due to my lack of knowledge), but still help would be much appreciated.
I am using two AWS t2.micro instances with default configuration.
When ssh into the master and installed K3S using
curl -sfL https://get.k3s.io | sh -s - --no-deploy traefik --write-kubeconfig-mode 644 --node-name k3s-master-01
with kubectl get nodes, I am able to see the master
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 13s v1.23.6+k3s1
So far it seems I am doing things right. From what I understand, I am supposed to configure the kubeconfig file. So, I accessed it by using
cat /etc/rancher/k3s/k3s.yaml
I copied the configuration file and the server info to match the private IP I took from AWS console, resulting in something like this
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: <lots_of_info>
server: https://<master_private_IP>:6443
name: default
contexts:
- context:
cluster: default
user: default
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
user:
client-certificate-data: <my_certificate_data>
client-key-data: <my_key_data>
Then, I ran vi ~/.kube/config, and there I pasted the kubeconfig file
Finally, I grabbed the token with cat /var/lib/rancher/k3s/server/node-token, ssh into the other machine and then run the following
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-01 K3S_URL=https://<master_private_IP>:6443 K3S_TOKEN=<master_token> sh -
The output is
[INFO] Finding release for channel stable
[INFO] Using v1.23.6+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/sha256sum-amd64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/k3s
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Creating /usr/local/bin/ctr symlink to k3s
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s-agent.service
[INFO] systemd: Enabling k3s-agent unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s-agent.service → /etc/systemd/system/k3s-agent.service.
[INFO] systemd: Starting k3s-agent
By this output, it looks like I have created an agent. However, when I run kubectl get nodes in the master, I still get
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 12m v1.23.6+k3s1
What is the thing I was supposed to do in order to get the agent connected to the master? I am guess I am probably missing something simple, but I just can't seem to find the solution. I've read all the documentation but it is still not clear to me where I am making the mistake. I've tried saving the private master IP and token into the agent as environmental variables with export K3S_TOKEN=master_token and K3S_URL=master_private_IP and then simply running curl -sfL https://get.k3s.io | sh - but I still can't see the worker nodes when running kubectl get nodes
Any help would be appreciated.

It might be your VM instance firewall that prevents appropriate connection from your master to the worker node (and vice versa). Official rancher documentation advise to disable firewall for (Red Hat/CentOS) Enterprise Linux:
It is recommended to turn off firewalld:
systemctl disable firewalld --now
If enabled, it is required to disable nm-cloud-setup and reboot the node:
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer reboot
If you are using Ubuntu on your VM's, there is a different firewall tool (ufw).
In my case, allowing 6443 and 443(not sure if required) port TCP connections worked fine.
Allow port 6443 and TCP connection in all of your cluster machines:
sudo ufw allow 6443/tcp
Then apply k3s installation script in your worker node(s):
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-1 K3S_URL=https://<k3s-master-1 IP>:6443 K3S_TOKEN=<k3s-master-1 TOKEN> sh -
This should work. If not, you can try adding additional allow rule for 443 tcp port as well.

A few options to check.
Check Journalctl for errors
journalctl -u k3s-agent.service -n 300 -xn
If using RaspberryPi for a worker node, make sure you have
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
as the very end of your /boot/cmdline.txt file. DO NOT PUT THIS VALUE ON A NEW LINE! Should just be appended to the end of the line.
If your master node(s) have self-signed certs, make sure you copy the master node's self signed cert to your worker node(s). In linux or raspberry pi copy cert to /usr/local/share/ca-certificates, then issue an
sudo update-ca-certificates
on the worker node
Don't forget to reboot the worker node after you make these changes!
Hope this helps someone!

"gclient sync" fails due to SSL3 certificate verify failed

I have been trying to fetch chromium source code. However, I got stuck on gclient sync for 2 days.
gclient sync fails every time due to error related to SSL certificate verification failure.
LOG is as below:
rna#rna-P580:~/workspace/project$ gclient sync
Syncing projects: 98% (83/84), done.
________ running 'download_from_google_storage --no_resume --platform=linux* --no_auth --bucket chromium-gn -s src/buildtools/linux32/gn.sha1' in '/home/rna/workspace/project'
/home/rna/workspace/project/depot_tools/third_party/boto/pyami/config.py:75: UserWarning: Unable to load AWS_CREDENTIAL_FILE ()
warnings.warn('Unable to load AWS_CREDENTIAL_FILE (%s)' % full_path)
Failure: [Errno 1] _ssl.c:509: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed.
Error: Command download_from_google_storage --no_resume --platform=linux* --no_auth --bucket chromium-gn -s src/buildtools/linux32/gn.sha1 returned non-zero exit status 1 in /home/rna/workspace/project
I am guessing this happens because i am behind company firewall.
So I requested to open http & https. But still no luck.
Can someone help me out, please? I'm on ubuntu 13.10

I ran into this problem as well, what fixed it for me was doing: sudo apt-get update and sudo apt-get upgrade.

I modified DEPS below /trunk directory, comment some code as :
#{
# # Download test resources, i.e. video and audio files from Google Storage.
# "pattern": "\\.sha1",
# "action": ["download_from_google_storage",
# "--directory",
# "--recursive",
# "--num_threads=10",
# "--no_auth",
# "--bucket", "chromium-webrtc-resources",
# Var("root_dir") + "/resources"],
# },
,
and retry to run gclient runhooks and I can get a correct result.
FROM:
https://code.google.com/p/webrtc/issues/detail?id=3314

Smtp error 451 Temporary local - please try later on Cpanel Server

I have a Cpanel Server.
It send emails correctly expect from 1 domain which hosted on the server , so when I try to send email from that domain using roundcube or Horde I got the errror
SMTP Error (451): Failed to add recipient "recipient#exmple.com" (Temporary local problem - please try later).
does anyone know why and how to fix this?
I found the porblem:
After reviewing the file /var/log/exim_mainlog using
tail -f /var/log/exim_mainlog
I noticed that the error was:
2013-05-29 20:04:28 SMTP connection from [127.0.0.1]:36797 (TCP/IP connection count = 1)
2013-05-29 20:04:28 lowest numbered MX record points to local host: domain.com (while verifying <user#domain.com> from host localhost.localdomain (domain.com) [127.0.0.1]:36797)
2013-05-29 20:04:28 H=localhost.localdomain (domain.com) [127.0.0.1]:36797 sender verify defer for <user#domain.com>: lowest numbered MX record points to local host
2013-05-29 20:04:28 H=localhost.localdomain (domain.com) [127.0.0.1]:36797 F=<user#domain.com> A=dovecot_login:narena temporarily rejected RCPT <recipient#exmple.com>: Could not complete sender verify
2013-05-29 20:04:28 SMTP connection from localhost.localdomain (domain.com) [127.0.0.1]:36797 closed by QUIT
so the main problem was:
lowest numbered MX record points to local host
after couple of search I found the soluation in http://forums.cpanel.net/f5/lowest-numbered-mx-record-points-local-host-73563.html
which was to:
login to WHM and go to Main >> DNS Functions >> Edit MX Entry for the domain
set MX priority to 0 for the related domain and save.

I had the same problem after running a script to fix directory permissions on a cPanel-powered server (CentOS 6.5). I checked the logfile (tail -f /var/log/exim_mainlog) and found this error:
require_files: error for /home/user_name/etc/domain.com: Permission denied
Just ran the following command and the issue was fixed:
chown -R user_name:mail /home/user_name/etc/
Hope this helps someone.

check the the file /var/log/exim_mainlog to see more information about the error
tail -f /var/log/exim_mainlog
while trying to send email

Check your MX Entry in Cpanel, if the existing domain priority is less than or equals to 0, set it to 1. Mine is fixed. Hope it will help you.

Wow, after about an hour of searching and meddling with different files, I'd caution any novice not to venture out editing anything before you have a backup or image if your server, as you can cause irrevocable damage to your server. So many people talking garbage about what you should do or test without any real solution.
Anyways, here's what worked for me:
Real problem: Exim was updated to latest version which has loads of bugs like this issue.
How I fixed my server:
Authenticate to Linux via SSH and run the command lines through which we download and install the old version of EXIM.
Command Line 1: wget https://ca1.dynanode.net/exim-4.93-3.el7.x86_64.rpm
Command Line 2: rpm -Uvh --oldpackage exim-4.93-3.el7.x86_64.rpm
Command Line 3: systemctl restart exim
Command Line 4: Systemctl restart clamd
Command Line 5: systemctl restart spamassassin
Optional: just type "Reboot" to restart your server
The command lines above does the following:
Downloads the old package (I'm sure you can google other sources with this file)
Install the old package without prompt
Restart the Exim service
Restart the Clamd Service (AV)
Restart the spamassassin service (Spam Filter)
Restart outlook or whatever you use for mail client and send an email. Mine works, hope yours do too.

Frequent worker timeout

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.
[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514
Why is this happening? How can I figure out what's going wrong?

We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.
After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..
So, Do:
1) open the gunicorn configuration file
2) set the TIMEOUT to what ever you need - the value is in seconds
NUM_WORKERS=3
TIMEOUT=120
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE

On Google Cloud
Just add --timeout 90 to entrypoint in app.yaml
entrypoint: gunicorn -b :$PORT main:app --timeout 90

Run Gunicorn with --log-level debug.
It should give you an app stack trace.

Is this endpoint taking too many time?
Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.
With gevent, a new call will spawn a new thread, and you app will be able to receive more requests
pip install gevent
gunicon .... --worker-class gevent

The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600
gunicorn --bind=0.0.0.0 --timeout 600 application:app
https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app

WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Another thing that may affect this is choosing the worker type
The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.
When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.
And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
So I suggest you to check what thing slowing down your application in the first place

Could it be this?
http://docs.gunicorn.org/en/latest/settings.html#timeout
Other possibilities could be your response is taking too long or is stuck waiting.

This worked for me:
gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000
If you have eventlet add:
--worker-class=eventlet
If you have gevent add:
--worker-class=gevent

I've got the same problem in Docker.
In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.
I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.
Here is my complete config:
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3

I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed
So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.

You need to used an other worker type class an async one like gevent or tornado see this for more explanation :
First explantion :
You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing
Second one :
The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.

If you are using GCP then you have to set workers per instance type.
Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime

timeout is a key parameter to this problem.
however it's not suit for me.
i found there is not gunicorn timeout error when i set workers=1.
when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.
socket.recv will block my code and that's why it always timeout when workers>1
hope to give some ideas to the people who have some problem with me

For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.

For me, it was because I forgot to setup firewall rule on database server for my Django.

Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".
(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).

Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.

The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:
timeout = 999
Then just run the server while pointing to this configuration file
gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app
note that for this statement to work you need wsgi.py also in the same directory having the following
from myproject import app
if __name__ == "__main__":
app.run()
Cheers!

Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,
proxy_connect_timeout 120s;
proxy_read_timeout 120s;

In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.
for people who come to the same problem I did. My solution was to send it in chunks like this:
ref / html example, separate large files ref
def upload_to_server():
upload_file_path = location
def read_in_chunks(file_object, chunk_size=524288):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(upload_file_path, 'rb') as f:
for piece in read_in_chunks(f):
r = requests.post(
url + '/api/set-doc/stream' + '/' + server_file_name,
files={name: piece},
headers={'key': key, 'allow_all': 'true'})
my flask server:
#app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
def api_set_file_streamed(name):
folder = escape(name) # secure_filename(escape(name))
if 'key' in request.headers:
if request.headers['key'] != key:
return 404
else:
return 404
for fn in request.files:
file = request.files[fn]
if fn == '':
print('no file name')
flash('No selected file')
return 'fail'
if file and allowed_file(file.filename):
file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
if not os.path.exists(file_dir_path):
os.makedirs(file_dir_path)
file_path = os.path.join(file_dir_path, secure_filename(file.filename))
with open(file_path, 'ab') as f:
f.write(file.read())
return 'sucess'
return 404

in case you have changed the name of the django project you should also go to
cd /etc/systemd/system/
then
sudo nano gunicorn.service
then verify that at the end of the bind line the application name has been changed to the new application name

Monit service name error

So I have the following in my monitrc file:
check process apache with pidfile /usr/local/apache/logs/httpd.pid
group apache
start program = "/etc/init.d/httpd start"
stop program = "/etc/init.d/httpd stop"
if failed host XXX port 80 protocol http
and request "/monit/token" then restart
if cpu is greater than 60% for 2 cycles then alert
if cpu 80% for 5 cycles then restart
if totalmem 500 MB for 5 cycles then restart
if children 250 then restart
if loadavg(5min) greater than 10 for 8 cycles then stop
if 3 restarts within 5 cycles then timeout
but I keep getting the error that:
Error: service name conflict, apache already defined '/usr/local/apache/logs/httpd.pid'

If the hostname of the server is 'apache' then the conflict is with the default rule for monitoring the system load.
Monit seems to have the implicit rule of 'check system hostname', where the hostname is the output of hostname command.
You can overwrite that by adding just a line like:
check system newhostname
For example:
check system localhost

I saw this error when I forgot to comment out the line:
include /etc/monit/conf.d/*
in a custom /etc/monit/conf.d/myprogram.conf file, so it was recursively including that file.

By any chance do you have an entry with a host name apache beneath this entry or in a separate monit config file?

You have the same service defined more than once. Check all your monit config files for that service. This includes your monitrc and all files listed under the "Includes" section (like include /etc/monit/conf.d/*).
If you redefine "Includes" within a file in one of your "Includes" directories, you will run into recursive reference problems.

Very very important thing : you need monit 5.5
For example in ubuntu 12.04 available in repo only 5.3
So you need to download and install from other repo.
Solution for me , for example :
wget http://mirrors.kernel.org/ubuntu/pool/universe/m/monit/monit_5.5.1-1_amd64.deb && sudo dpkg -i monit_5.5.1-1_amd64.deb

For my case, I simply had to restart monit to get rid of the service name error:
sudo service monit restart

Check if you have had any conflicts for Apache defined in any of the monit conf files under /etc/monit.d/ directory, I accidentally did added nginx for my puma.conf and ran into the same error before.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

supervisor (with gunicorn) stops logging after http error 500 - gunicorn

Turns out the issue was with the worker in python itself. If you try to log something that the logger cannot interpret, the logger becomes foobared and any further attempt to log is doomed.

Related

Unable to connect worker node to master using K3S

"gclient sync" fails due to SSL3 certificate verify failed

Smtp error 451 Temporary local - please try later on Cpanel Server

Frequent worker timeout

Monit service name error

Categories

Resources