Why does my openshift app timeout when I try to access the URL? - openshift

I am trying to set up a BrowserQuest server that runs in openshift
I've been following this readme. Everything seems to go fine, I get to the end and run rhc app show bq and get the following output:
bq # http://bq-plantagenet.rhcloud.com/ (uuid: 55e4311189f5cf028d0000fc)
------------------------------------------------------------------------
Domain: plantagenet
Created: 8:18 AM
Gears: 1 (defaults to small)
Git URL: ssh://55e4311189f5cf028d0000fc#bq-plantagenet.rhcloud.com/~/git/bq.git/
SSH: 55e4311189f5cf028d0000fc#bq-plantagenet.rhcloud.com
Deployment: auto (on git push)
nodejs-0.10 (Node.js 0.10)
--------------------------
Gears: Located with smarterclayton-redis-2.6
smarterclayton-redis-2.6 (Redis)
--------------------------------
From: http://cartreflect-claytondev.rhcloud.com/reflect?github=smarterclayton/openshift-redis-cart
Website: https://github.com/smarterclayton/openshift-redis-cart
Gears: Located with nodejs-0.10
But when I try to access http://bq-plantagenet.rhcloud.com:8080/ in a browser, I get:
The connection has timed out
The server at bq-plantagenet.rhcloud.com is taking too long to respond
My questions are what is going wrong and how can I fix it? Many thanks for your consideration in reading through this and any suggestions you might have for resolving it

You need to access http://bq-plantagenet.rhcloud.com, leave off the port 8080, that is the port you listen on internally. You should also try checking your log files (https://developers.openshift.com/en/managing-log-files.html) to see what errors your application is producing.

Related

Deployment "tiller" exceeded its progress deadline

I'm trying to install tiller server to an Openshift project
Helm/tiller version: 2.9.0
My project name: paytiller
At step 3, executing this command (mentioned as per this document - https://www.openshift.com/blog/getting-started-helm-openshift)
oc rollout status deployment tiller
I get this error:
error: deployment "tiller" exceeded its progress deadline
I'm not clear on what's the error message or could find any logs.
Any idea why this error?
If this doesn't work, what are the other suggestions for templating in Openshift?
EDIT
oc get events
Events:
Type Reason Age From Message
---- ------ ---- ---- ---
Warning Failed 14m (x5493 over 21h) kubelet, example.com Error: ImagePullBackOff
Normal Pulling 9m (x255 over 21h) kubelet, example.com pulling image "gcr.io/kubernetes-helm/tiller:v2.9.0"
Normal BackOff 4m (x5537 over 21h) kubelet, example.com Back-off pulling image "gcr.io/kubernetes-helm/tiller:v2.9.0"
Thanks.
The issue was with the permissions on our OpenShift platform. We didn't have access to download from open-source directly.
We tried to add kubernetes-helm as a docker image to our organization repository and then we were able to pull the image to OpenShift project. It is working now. But still, we didn't get any clue of the issue from the logs.
The status ImagePullBackOff tells you that this image gcr.io/kubernetes-helm/tiller:v2.9.0 could not be pulled from the container registry. So your OpenShift node cannot pull that image for some reason. This is often due to network proxies, a non-existing image (not the issue here) or other restrictions in the (corporate) network.
You can use oc describe pod <pod that shows ImagePullBackOff> to find out the more detailed error message that may help you further.
Also, note that the blog post you linked is from 2017, which is very old. Here is a more current version: Build Kubernetes Operators from Helm Charts in 5 steps
.

Hyperledger Composer CLI Ping to a Business Network returns AccessException

Im trying to learn Hyperledger Composer but seems to be a relatively new technology, i mean there are few tutorials and few solutions to a lot of questions, tutorial does not mention possible error case when following the commands and which means there are is also no solution for those errors.
I have joined the composer channel in their community chat, looks like its running in Discord or something, and asked the same question without a response, i have a better experience here in SO.
This is the problem: I have deployed my business network, installed it, started it, created my network admin card and imported it, then to test if everything is ok i have to command composer network ping --card NAME-OF-MY-ADMIN-CARD
And this error comes:
juan#JuanDeDios:~/proyectos/inovacion/a3-poliza-microservice$ composer network ping --card admin#a3-policy-microservice
Error: transaction returned with failure: AccessException: Participant 'org.hyperledger.composer.system.NetworkAdmin#admin' does not have 'READ' access to resource 'org.hyperledger.composer.system.Network#a3-policy-microservice#0.0.1'
Command failed
I think that it has to do something with the permission.acl file, and gave permission to everyone to everything so there would not be any restrictions to anyone, and tryied again, but failed.
So i thought i had to uninstall my business network and create it again, i deleted my .bna and my network.card files also so everything would be created again, but the same error result.
My other attempt was to update the business network, but didn't work, the same error happened and I'm sure i didn't miss any step from the tutorial. I do also followed the playground tutorial. What i have not done its to create another app with the Yeoman but i will do if i don't find a solution to this problem which would not require me to create another app.
This were my steps:
1-. Created my app with Yeoman
yo hyperledger-composer:businessnetwork
2-. Selected Apache-2.0 for my license
3-. Created a3-policy-microservice as the name of the business network
4-. Created org.microservice.policy (Yeah i switched names but Im totally aware)
5-. Generated my app with a template selecting the NO option
6-. Created my assets, participants and transactions
7-. Changed my permission rules to mine
8-. I generated the .bna file
composer archive create -t dir -n .
9-. Then installed my bna file
composer network install --card PeerAdmin#hlfv1 --archiveFile a3-policy-microservice#0.0.1.bna
10-. Then started my network and created my networkadmin card
composer network start --networkName a3-policy-network --networkVersion 0.0.1 --networkAdmin admin --networkAdminEnrollSecret adminpw --card PeerAdmin#hlfv1 --file networkadmin.card
11-. Imported my card
composer card import --file networkadmin.card
12-. Tried to ping my network
composer network ping --card admin#a3-poliza-microservice
And the error happens
Later i tried to create everything again shutting down my fabric and started it again and creating the network from the first step.
My other attempt was to change the permissions and upgrade my bna network, but it failed too. Im running out of options
Hope this description its not too long to ignore it. Thanks in advance
thanks for the question!
First possibility is that your network name is a3-policy-network but you're pinging a network called a3-poliza-microservice - once you do get the correct ACLs in place (currently, that's the error you're trying to resolve).
The procedure for upgrade would normally be the procedure below:
After your step 12 (where you can't ping the business network due to restrictive ACL conditions, assuming you are using the right network name) you would have:
Make the changes to to include your System ACLs this time eg.
/**
* Sample access control list.
*/
rule SystemACL {
description: "System ACL to permit all access"
participant: "org.hyperledger.composer.system.Participant"
operation: ALL
resource: "org.hyperledger.composer.system.**"
action: ALLOW
}
rule NetworkAdminUser {
description: "Grant business network administrators full access to user resources"
participant: "org.hyperledger.composer.system.NetworkAdmin"
operation: ALL
resource: "**"
action: ALLOW
}
rule NetworkAdminSystem {
description: "Grant business network administrators full access to system resources"
participant: "org.hyperledger.composer.system.NetworkAdmin"
operation: ALL
resource: "org.hyperledger.composer.system.**"
action: ALLOW
}
Update the "version" field in your existing package.json in your Business Network project directory (ie need to change it next increment - eg. update the version property from 0.0.1 to 0.0.2.)
From the same directory, run the following command:
composer archive create --sourceType dir --sourceName . -a a3-policy-network#0.0.2.bna
Now install the new business network code firstly:
composer network install --card PeerAdmin#hlfv1 --archiveFile a3-policy-network#0.0.2.bna
Then perform the requisite upgrade step (single '-' for short form of the parameter):
composer network upgrade -c PeerAdmin#hlfv1 -n a3-policy-network -V 0.0.2
After a few seconds, ping the network again to see ACL changes are now in effect:
composer network ping -c a3-policy-network

go-ethereum - geth - puppeth - ethstat remote server : docker: command not found

I'm trying to setup a private ethereum test network using Puppeth (as Péter Szilágyi demoed in Ethereum devcon three 2017). I'm running it on a macbook pro (macOS Sierra).
When I try to setup the ethstat network component I get an "docker configured incorrectly: bash: docker: command not found" error. I have docker running and I can use it fine in the terminal e.g. docker ps.
Here are the steps I took:
What would you like to do? (default = stats)
1. Show network stats
2. Manage existing genesis
3. Track new remote server
4. Deploy network components
> 4
What would you like to deploy? (recommended order)
1. Ethstats - Network monitoring tool
2. Bootnode - Entry point of the network
3. Sealer - Full node minting new blocks
4. Wallet - Browser wallet for quick sends (todo)
5. Faucet - Crypto faucet to give away funds
6. Dashboard - Website listing above web-services
> 1
Which server do you want to interact with?
1. Connect another server
> 1
Please enter remote server's address:
> localhost
DEBUG[11-15|22:46:49] Attempting to establish SSH connection server=localhost
WARN [11-15|22:46:49] Bad SSH key, falling back to passwords path=/Users/xxx/.ssh/id_rsa err="ssh: cannot decode encrypted private keys"
The authenticity of host 'localhost:22 ([::1]:22)' can't be established.
SSH key fingerprint is xxx [MD5]
Are you sure you want to continue connecting (yes/no)? yes
What's the login password for xxx at localhost:22? (won't be echoed)
>
DEBUG[11-15|22:47:11] Verifying if docker is available server=localhost
ERROR[11-15|22:47:11] Server not ready for puppeth err="docker configured incorrectly: bash: docker: command not found\n"
Here are my questions:
Is there any documentation / tutorial describing how to setup this remote server properly. Or just on puppeth in general?
Can I not use localhost as "remote server address"
Any ideas on why the docker command is not found (it is installed and running and I can use it ok in the terminal).
Here is what I did.
For the docker you have to use the docker-compose binary. You can find it here.
Furthermore, you have to be sure that an ssh server is running on your localhost and that keys have been generated.
I didn't find any documentations for puppeth whatsoever.
I think I found the root cause to this problem. The SSH daemon is compiled with a default path. If you ssh to a machine with a specific command (other than a shell), you get that default path. This does not include /usr/local/bin for example, where docker lives in my case.
I found the solution here: https://serverfault.com/a/585075:
edit /etc/ssh/sshd_config and make sure it contains PermitUserEnvironment yes (you need to edit this with sudo)
create a file ~/.ssh/environment with the path that you want, in my case:
PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin
When you now run ssh localhost env you should see a PATH that matches whatever you put in ~/.ssh/environment.

Openshift 3 , 503 Error (No server is available to handle this request)

I have created a web application using jsp/tiles/struts/mysql/tomcat. I created new project on Openshift 3 console (Openshift online) https://console.preview.openshift.com/console/ then added tomcat/mySql. I was getting 503 error sometimes and other times, same page was working as expected. 503 error came randomly for any page from my project. When I get 503 error, I refresh some no of times and it goes away, and my page is correctly displayed.
Error that I see is:
"503 Service Unavailable
No server is available to handle this request. "
I did some research:
What I understand from this openshift 2 link:
https://blog.openshift.com/how-to-host-your-java-ee-application-with-auto-scaling/
is that to correct 503 error:
SSH into your application gear using rhc ssh --app <app_name>
Change directory to haproxy/conf
change the following in haproxy.cfg option httpchk GET / to option httpchk GET /api/v1/ping
Restart the HAProxy cartridge from your local machine using RHC rhc cartridge-restart --cartridge haproxy
I dont know if it is also applicable to openshift 3. In openshift 3 where is haproxy.log, haproxy.cfg, haproxy/conf or its slightly different in openshift 3. (Nut thanks to Warrens comments, yes he saw 503 error in openshift related to HAProxy)
Now after 1 week after posting this question:
I am getting Quota Reached Error. I am able to build my project but all deployments are failing. I wonder if 503 error that I was getting earlier(either completely or partially) was related to Quota reached. How should I proceed now.
curl -i localhost:8080/GEA
HTTP/1.1 302 Found Server:
Apache-Coyote/1.1
Location: http://localhost:8080/GEA/
Transfer-Encoding: chunked Date: Tue, 11 Apr 2017 18:03:25 GMT
Tomcat logs do not show any application error.
Will Readiness Probe and Liveness Probe help me? I have not set them yet.
Nor do I know how to set them.
Will scaling help me (I dont know how to set it either)
Do I have to set memory/... all at maximum allowed to ensure project runs smooth?
For me I had a similar situation of getting 503's sometimes and sometimes getting my actual page. the reason was because you have haproxy on the frontend handling the requests. Depending on your setup you may even have a few haproxy pods and your request could be funneled between one of the pods. So as in my case one pod was working and the other not.
So basically
oc get pods -n default
NAME READY STATUS RESTARTS AGE
docker-registry-7-i02rh 1/1 Running 0 75d
registry-console-12-wciib 1/1 Running 0 67d
router-1-533cg 1/1 Running 3 76d
router-1-9utld 1/1 Running 1 76d
router-1-uwf64 1/1 Running 1 76d
As you can see in my output default namespace is where my router(haproxy) pods live. If I change to that namespace
oc project default
Then run
oc logs -f router-1-533cg
on each of the pods you will most likely find a sepcific pod that is behaving bad. You can simply delete, and the replication controller will create a new one

Frequent worker timeout

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.
[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514
Why is this happening? How can I figure out what's going wrong?
We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.
After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..
So, Do:
1) open the gunicorn configuration file
2) set the TIMEOUT to what ever you need - the value is in seconds
NUM_WORKERS=3
TIMEOUT=120
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE
On Google Cloud
Just add --timeout 90 to entrypoint in app.yaml
entrypoint: gunicorn -b :$PORT main:app --timeout 90
Run Gunicorn with --log-level debug.
It should give you an app stack trace.
Is this endpoint taking too many time?
Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.
With gevent, a new call will spawn a new thread, and you app will be able to receive more requests
pip install gevent
gunicon .... --worker-class gevent
The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600
gunicorn --bind=0.0.0.0 --timeout 600 application:app
https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app
WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Another thing that may affect this is choosing the worker type
The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.
When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.
And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
So I suggest you to check what thing slowing down your application in the first place
Could it be this?
http://docs.gunicorn.org/en/latest/settings.html#timeout
Other possibilities could be your response is taking too long or is stuck waiting.
This worked for me:
gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000
If you have eventlet add:
--worker-class=eventlet
If you have gevent add:
--worker-class=gevent
I've got the same problem in Docker.
In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.
I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.
Here is my complete config:
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3
I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed
So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.
You need to used an other worker type class an async one like gevent or tornado see this for more explanation :
First explantion :
You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing
Second one :
The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.
If you are using GCP then you have to set workers per instance type.
Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime
timeout is a key parameter to this problem.
however it's not suit for me.
i found there is not gunicorn timeout error when i set workers=1.
when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.
socket.recv will block my code and that's why it always timeout when workers>1
hope to give some ideas to the people who have some problem with me
For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.
For me, it was because I forgot to setup firewall rule on database server for my Django.
Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".
(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).
Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.
The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:
timeout = 999
Then just run the server while pointing to this configuration file
gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app
note that for this statement to work you need wsgi.py also in the same directory having the following
from myproject import app
if __name__ == "__main__":
app.run()
Cheers!
Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,
proxy_connect_timeout 120s;
proxy_read_timeout 120s;
In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.
for people who come to the same problem I did. My solution was to send it in chunks like this:
ref / html example, separate large files ref
def upload_to_server():
upload_file_path = location
def read_in_chunks(file_object, chunk_size=524288):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(upload_file_path, 'rb') as f:
for piece in read_in_chunks(f):
r = requests.post(
url + '/api/set-doc/stream' + '/' + server_file_name,
files={name: piece},
headers={'key': key, 'allow_all': 'true'})
my flask server:
#app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
def api_set_file_streamed(name):
folder = escape(name) # secure_filename(escape(name))
if 'key' in request.headers:
if request.headers['key'] != key:
return 404
else:
return 404
for fn in request.files:
file = request.files[fn]
if fn == '':
print('no file name')
flash('No selected file')
return 'fail'
if file and allowed_file(file.filename):
file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
if not os.path.exists(file_dir_path):
os.makedirs(file_dir_path)
file_path = os.path.join(file_dir_path, secure_filename(file.filename))
with open(file_path, 'ab') as f:
f.write(file.read())
return 'sucess'
return 404
in case you have changed the name of the django project you should also go to
cd /etc/systemd/system/
then
sudo nano gunicorn.service
then verify that at the end of the bind line the application name has been changed to the new application name