Hot reconfiguration of HAProxy still lead to failed request, any suggestions? - configuration

I found there are still failed request when the traffic is high using command like this
haproxy -f /etc/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
to hot reload the updated config file.
Here below is the presure testing result using webbench :
/usr/local/bin/webbench -c 10 -t 30 targetHProxyIP:1080
Webbench – Simple Web Benchmark 1.5
Copyright (c) Radim Kolar 1997-2004, GPL Open Source Software.
Benchmarking: GET targetHProxyIP:1080
10 clients, running 30 sec.
Speed=70586 pages/min, 13372974 bytes/sec.
**Requests: 35289 susceed, 4 failed.**
I run command
haproxy -f /etc/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
several times during the pressure testing.
In the haproxy documentation, it mentioned
They will receive the SIGTTOU
611 signal to ask them to temporarily stop listening to the ports so that the new
612 process can grab them
so there is a time period that the old process is not listening on the PORT(say 80) and the new process haven’t start to listen to the PORT (say 80), and during this specific time period, it will cause the NEW connections failed, make sense?
So is there any approach that makes the configuration reload of haproxy that will not impact both existing connections and new connections?

On recent kernels where SO_REUSEPORT is finally implemented (3.9+), this dead period does not exist anymore. While a patch has been available for older kernels for something like 10 years, it's obvious that many users cannot patch their kernels. If your system is more recent, then the new process will succeed its attempt to bind() before asking the previous one to release the port, then there's a period where both processes are bound to the port instead of no process.
There is still a very tiny possibility that a connection arrived in the leaving process' queue at the moment it closes it. There is no reliable way to stop this from happening though.

Related

fatal: the remote hung up unexpectedly. Can’t push things to my git repo

My first problem looked like this:
Writing objects: 60% (9/15)
It freezed there for some time with very low upload speed (in kb/s), then, after long time, gave this message:
fatal: the remote end hung up unexpectedly
Everything up-to-date
I found something what seemed to be a solution:
git config http.postBuffer 524288000
This created a new problem that looks like this:
MacBook-Pro-Liana:LC | myWebsite Liana$ git config http.postBuffer 524288000
MacBook-Pro-Liana:LC | myWebsite Liana$ git push -u origin master
Enumerating objects: 15, done.
Counting objects: 100% (15/15), done.
Delta compression using up to 4 threads
Compressing objects: 100% (14/14), done.
Writing objects: 100% (15/15), 116.01 MiB | 25.16 MiB/s, done.
Total 15 (delta 2), reused 0 (delta 0)
error: RPC failed; curl 56 LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 54
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
Everything up-to-date
Please help, I have no idea what’s going on...
First, Git 2.25.1 made it clear that:
Users in a wide variety of situations find themselves with HTTP push problems.
Oftentimes these issues are due to antivirus software, filtering proxies, or other man-in-the-middle situations; other times, they are due to simple unreliability of the network.
This works for none of the aforementioned situations and is only useful in a small, highly restricted number of cases: essentially, when the connection does not properly support HTTP/1.1.
Raising this is not, in general, an effective solution for most push problems, but can increase memory consumption significantly since the entire buffer is allocated even for small pushes.
Second, it depends on your actual remote (GitHub? GitLab? BitBucket? On-premise server). Said remote server might have an incident in progress.

DIY cartridge stops on git push

I've been developing an application for some weeks, and it's been running in a OpenShift small gear with DIY 0.1 + PostgreSQL cartridges for several days, including ~5 new deployments. Everything was ok and a new deploy stopped and started everything in seconds.
Nevertheless today pushing master as usual stops the cartridge and it won't start. This is the trace:
Counting objects: 2688, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (1930/1930), done.
Writing objects: 100% (2080/2080), 10.76 MiB | 99 KiB/s, done.
Total 2080 (delta 1300), reused 13 (delta 0)
remote: Stopping DIY cartridge
fatal: The remote end hung up unexpectedly
fatal: The remote end hung up unexpectedly
Logging in with ssh and running the start action hook manually fails because database is stopped. Restarting the gear makes everything work again.
The failing deployment has nothing to do with it, since it only adds a few lines of code, nothing about configuration or anything that might break the boot.
Logs (at $OPENSHIFT_LOG_DIR) reveal nothing. Quota usage seems fine:
Cartridges Used Limit
---------------------- ------ -----
diy-0.1 postgresql-9.2 0.6 GB 1 GB
Any suggestions about what could I check?
Oh, dumb mistake. My last working deployment involved a change in the binary name, which now matches the gear name. stop script, with ps grep and so on was wrong, not killing only the application but also the connection. Changing it fixed the issue.
Solution inspired by this blogpost.

Multiple HAProxy instances on OpenShift

I have an application (Node.JS) deployed on OpenShift (bronze plan) with the Web Load Balancer activated, the minimum gears active are 3 and the max are 16.
Sometimes in the main gear I can see more than one HAProxy instance running, for example now I have:
> ps -ef|grep /usr/sbin/haproxy
3505 37488 1 1 08:46 ? 00:00:01 /usr/sbin/haproxy -f /var/lib/openshift/<APP_ID>/haproxy//conf/haproxy.cfg -sf 37237
3505 149643 1 1 May28 ? 00:09:08 /usr/sbin/haproxy -f /var/lib/openshift/<APP_ID>/haproxy//conf/haproxy.cfg -sf 114873
looking the logs I can't any error. Any explanation about this?
Thanks!
This could be a consequence of executing Haproxy reload script (/etc/init.d/haproxy). This will usually create a new haproxy process to accept new connections. It will also keep the old process alive until there are still open connections to it. Once they are closed, old haproxy process will be terminated.

Frequent worker timeout

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.
[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514
Why is this happening? How can I figure out what's going wrong?
We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.
After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..
So, Do:
1) open the gunicorn configuration file
2) set the TIMEOUT to what ever you need - the value is in seconds
NUM_WORKERS=3
TIMEOUT=120
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE
On Google Cloud
Just add --timeout 90 to entrypoint in app.yaml
entrypoint: gunicorn -b :$PORT main:app --timeout 90
Run Gunicorn with --log-level debug.
It should give you an app stack trace.
Is this endpoint taking too many time?
Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.
With gevent, a new call will spawn a new thread, and you app will be able to receive more requests
pip install gevent
gunicon .... --worker-class gevent
The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600
gunicorn --bind=0.0.0.0 --timeout 600 application:app
https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app
WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Another thing that may affect this is choosing the worker type
The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.
When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.
And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
So I suggest you to check what thing slowing down your application in the first place
Could it be this?
http://docs.gunicorn.org/en/latest/settings.html#timeout
Other possibilities could be your response is taking too long or is stuck waiting.
This worked for me:
gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000
If you have eventlet add:
--worker-class=eventlet
If you have gevent add:
--worker-class=gevent
I've got the same problem in Docker.
In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.
I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.
Here is my complete config:
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3
I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed
So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.
You need to used an other worker type class an async one like gevent or tornado see this for more explanation :
First explantion :
You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing
Second one :
The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.
If you are using GCP then you have to set workers per instance type.
Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime
timeout is a key parameter to this problem.
however it's not suit for me.
i found there is not gunicorn timeout error when i set workers=1.
when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.
socket.recv will block my code and that's why it always timeout when workers>1
hope to give some ideas to the people who have some problem with me
For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.
For me, it was because I forgot to setup firewall rule on database server for my Django.
Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".
(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).
Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.
The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:
timeout = 999
Then just run the server while pointing to this configuration file
gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app
note that for this statement to work you need wsgi.py also in the same directory having the following
from myproject import app
if __name__ == "__main__":
app.run()
Cheers!
Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,
proxy_connect_timeout 120s;
proxy_read_timeout 120s;
In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.
for people who come to the same problem I did. My solution was to send it in chunks like this:
ref / html example, separate large files ref
def upload_to_server():
upload_file_path = location
def read_in_chunks(file_object, chunk_size=524288):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(upload_file_path, 'rb') as f:
for piece in read_in_chunks(f):
r = requests.post(
url + '/api/set-doc/stream' + '/' + server_file_name,
files={name: piece},
headers={'key': key, 'allow_all': 'true'})
my flask server:
#app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
def api_set_file_streamed(name):
folder = escape(name) # secure_filename(escape(name))
if 'key' in request.headers:
if request.headers['key'] != key:
return 404
else:
return 404
for fn in request.files:
file = request.files[fn]
if fn == '':
print('no file name')
flash('No selected file')
return 'fail'
if file and allowed_file(file.filename):
file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
if not os.path.exists(file_dir_path):
os.makedirs(file_dir_path)
file_path = os.path.join(file_dir_path, secure_filename(file.filename))
with open(file_path, 'ab') as f:
f.write(file.read())
return 'sucess'
return 404
in case you have changed the name of the django project you should also go to
cd /etc/systemd/system/
then
sudo nano gunicorn.service
then verify that at the end of the bind line the application name has been changed to the new application name

How do I PSRemote start procdump such that it persists after the session ends

I can start a persistent process on unix with:
nohup process &
It will continue to run after I close my bash session. I cannot seem to do the same with PowerShell remoting on Windows. I can open a PSRemote session with a server and start a process, but as soon as I close that session it dies. My assumption is this is a benefit of strong sandboxing, but it's a benefit I'd rather work around somehow. Any ideas?
So far I've tried:
$exe ='d:\procdump.exe'
$processArgs = '-ma -e -t -n 3 -accepteula w3wp.exe d:\Dumps'
1) [System.Diagnostics.Process]::Start($exe,$processArgs)
2) Start-Job -ScriptBlock {param($exe,$processArgs) [System.Diagnostics.Process]::Start($exe,$processArgs)} -ArgumentList ($exe,$processArgs)
3) start powershell {param($exe ='d:\procdump.exe', $processArgs = '-ma -e -t -n 3 -accepteula w3wp.exe d:\Dumps') [System.Diagnostics.Process]::Start($exe,$processArgs)}
4) start powershell {param($exe ='d:\procdump.exe', $processArgs = '-ma -e -t -n 3 -accepteula w3wp.exe d:\Dumps') Start-Job -ScriptBlock {param($exe,$processArgs) [System.Diagnostics.Process]::Start($exe,$processArgs)} -ArgumentList ($exe,$processArgs)}
The program runs up until I close the session, then the procdump is reaped. The coolest thing about procdump is it will self-terminate, and I'd like to leave it running to take advantage of that fact.
I'd been starting ADPlus remotely, holding a session open, and just terminating the session to kill the captures. That's kind of handy, but it requires an awful lot of polling, inspecting, and deciding when is the right moment to kill the capture process before filling up the hard drive but after capturing enough dumps to be useful. I can leave procdump running indefinitely while it waits for an appropriate trigger and when it's captured enough data it will just die. That's lovely.
I just need to get procdump to keep running after I terminate my remote session. It's probably not worth creating a procdump scheduled task and starting it, but that's about the last idea I've got left.
Thanks.
This is not directly possible. Indirectly, yes a task or a service could be created and started remotely, but simply pushing a process off into the SYSTEM space is not.
I resolved my issue by spawning a local job to start the remote job and remain alive for the required period of time. The local job holds the remote session open then dies at the appropriate time, and the parent local process is able to continue to run uninterrupted and harvest the return value of the remote procdump with ReceiveJob if I happen to care.