I am trying to configure envoy as load balancer, currently stuck with fallbacks. In my playground cluster I have 3 backend servers and envoy as front proxy. I am generating some traffic on envoy using siege and watching the responses. While doing this I stop one of the backends.
What do I want: Envoy should resend failed requests from stopped backend to healthy one, so I will not get any 5xx responses
What do I get: When stopping backend I get some 503 responses, and then everything becomes normal again
What am I doing wrong? I think, fallback_policy should provide this functionality, but it doesn't work.
Here is my config file:
node:
id: LoadBalancer_01
cluster: HighloadCluster
admin:
access_log_path: /var/log/envoy/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: http_listener
address:
socket_address: { address: 0.0.0.0, port_value: 80 }
filter_chains:
- filters:
- name: envoy.http_connection_manager
typed_config:
"#type": type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: request_route
virtual_hosts:
- name: local_service
domains: ["*"]
require_tls: NONE
routes:
- match: { prefix: "/" }
route:
cluster: backend_service
timeout: 1.5s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
http_filters:
- name: envoy.router
typed_config:
"#type": type.googleapis.com/envoy.config.filter.http.router.v2.Router
name: envoy.file_access_log
typed_config:
"#type": type.googleapis.com/envoy.config.accesslog.v2.FileAccessLog
path: /var/log/envoy/access.log
clusters:
- name: backend_service
connect_timeout: 0.25s
type: STATIC
lb_policy: ROUND_ROBIN
lb_subset_config:
fallback_policy: ANY_ENDPOINT
outlier_detection:
consecutive_5xx: 1
interval: 10s
load_assignment:
cluster_name: backend_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 1.1.1.1
port_value: 10000
- endpoint:
address:
socket_address:
address: 2.2.2.2
port_value: 10000
- endpoint:
address:
socket_address:
address: 3.3.3.3
port_value: 10000
health_checks:
- http_health_check:
path: /api/liveness-probe
timeout: 1s
interval: 30s
unhealthy_interval: 10s
unhealthy_threshold: 2
healthy_threshold: 1
always_log_health_check_failures: true
event_log_path: /var/log/envoy/health_check.log```
TL;DR
You can use a circuit breaker (see config example below), alongside with your retry_policy and outlier_detection.
Explanation
Context
I have successfully reproduced your issue with your config (except the health_checks part, because I found that it was not useful to reproduce your problem).
I have run envoy and my backend (2 apps load-balanced), generated some traffic with hey (50 threads making requests concurrently during 10 seconds):
hey -c 50 -z 10s http://envoy:8080
And I have stopped one backend app around 5s after the command started.
Result
When diving into envoy admin /stats endpoint, I noticed interesting stuff:
cluster.backend_service.upstream_rq_200: 17899
cluster.backend_service.upstream_rq_503: 28
cluster.backend_service.upstream_rq_retry_overflow: 28
cluster.backend_service.upstream_rq_retry_success: 3
cluster.backend_service.upstream_rq_total: 17930
There were indeed 28 503 responses when I stopped one backend app. But retry worked somehow: 3 retries were successful (upstream_rq_retry_success), but 28 other retries failed (upstream_rq_retry_overflow), resulting to 503 errors. Why ?
From the cluster stats docs:
upstream_rq_retry_overflow : Total requests not retried due to circuit breaking or exceeding the retry budget
Fix
To solve this, we can add a circuit breaker in the cluster (I have been generous with max_requests, max_pending_requests and max_retries parameters for the example). The interesting part is retry_budget.budget_percent value:
clusters:
- name: backend_service
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
outlier_detection:
consecutive_5xx: 1
interval: 10s
circuit_breakers:
thresholds:
- priority: "DEFAULT"
max_requests: 0xffffffff
max_pending_requests: 0xffffffff
max_retries: 0xffffffff
retry_budget:
budget_percent:
value: 100.0
From the retry_budget docs:
budget_percent: Specifies the limit on concurrent retries as a percentage of the sum of active requests and active pending requests. For example, if there are 100 active requests and the budget_percent is set to 25, there may be 25 active retries.
This parameter is optional. Defaults to 20%.
I set it to 100.0 to allow 100% of active retries.
When running the example again with this new config, there is no more upstream_rq_retry_overflow, so no more 503 errors:
cluster.backend_service.upstream_rq_200: 17051
cluster.backend_service.upstream_rq_retry_overflow: 0
cluster.backend_service.upstream_rq_retry_success: 5
cluster.backend_service.upstream_rq_total: 17056
Note that if you experience upstream_rq_retry_limit_exceeded, you can try to set and increase retry_budget.min_retry_concurrency (default when not set is 3):
retry_budget:
budget_percent:
value: 100.0
min_retry_concurrency: 10
Related
Any help is appreciated, Please let me know were I am going wrong!
I am getting errors shown in the following image, I am running Loki and Grafana as 2 different AWS ECS-FARGATE tasks but my Liki container is failing and keep restarting itself:
My loki-config.yaml:
auth_enabled: true
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 1h # Any chunk not receiving new logs in this time will be flushed
max_chunk_age: 1h # All chunks will be flushed when they hit this age, default is 1h
chunk_target_size: 1048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
chunk_retain_period: 30s # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
max_transfer_retries: 0 # Chunk transfers disabled
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: index_
period: 24h
storage_config:
aws:
s3: s3://XXXXX:YYYY#eu-west-1/logs-loki-test
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: s3
compactor:
working_directory: /loki/boltdb-shipper-compactor
shared_store: aws
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
ruler:
storage:
type: local
local:
directory: /loki/rules
rule_path: /loki/rules-temp
alertmanager_url: http://localhost:9093
ring:
kvstore:
store: inmemory
enable_api: true
In the compactor block, line shared_store replace aws with s3 and try out
I was trying to deploy an ERC 721 token using truffle in Polygon's mumbai testnet.
And i have 2.8296 Matic in my metasmask wallet which i got from their faucet. But when i try to run
truffle migrate --network matic
it says
Error: *** Deployment Failed ***
"Migrations" could not deploy due to insufficient funds
* Account: 0x12aADAdd301d22c941DACF2cfa7A9e2019972F61
* Balance: 0 wei
* Message: insufficient funds for gas * price + value
* Try:
+ Using an adequately funded account
+ If you are using a local Geth node, verify that your node is synced.
Am i doing something wrong? What sould be the gas and gas price i should mention in truffle config file.
Here is my truffle-config file
module.exports = {
networks: {
development: {
host: "127.0.0.1", // Localhost (default: none)
port: 8545, // Standard Ethereum port (default: none)
network_id: "*", // Any network (default: none)
},
matic: {
provider: () => new HDWalletProvider(process.env.MNEMONIC, `https://rpc-mumbai.maticvigil.com/v1/91fdbb5c2f37c699621ss7d2b8b127fc1a123060
`),
network_id: 80001,
confirmations: 2,
timeoutBlocks: 200,
skipDryRun: true
},
},
// Set default mocha options here, use special reporters etc.
mocha: {
// timeout: 100000
},
enter image description here
I am pretty sure there is no balance of this account 0x12aADAdd301d22c941DACF2cfa7A9e2019972F61 in Mumbai-Testnet
Summary:
I already have a setup of "static jenkins server" type jenkins-x running in openshift 3.11 provider. The cluster was crashed and I want to reinstall jenkins-x in my cluster but there is no support for "static jenkins server" now.
So I am trying to install "jenkins-x" via "jx boot" but the installation fails with "tekton pipeline controller" pod into "crashloopbackoff" state.
Steps to reproduce the behavior:
jx-requirements.yml:
autoUpdate:
enabled: false
schedule: ""
bootConfigURL: https://github.com/jenkins-x/jenkins-x-boot-config.git
cluster:
clusterName: cic-60
devEnvApprovers:
- automation
environmentGitOwner: cic-60
gitKind: bitbucketserver
gitName: bs
gitServer: http://rtx-swtl-git.fnc.net.local
namespace: jx
provider: openshift
registry: docker-registry.default.svc:5000
environments:
- ingress:
domain: 172.29.35.81.nip.io
externalDNS: false
namespaceSubDomain: -jx.
tls:
email: ""
enabled: false
production: false
key: dev
repository: environment-cic-60-dev
- ingress:
domain: ""
externalDNS: false
namespaceSubDomain: ""
tls:
email: ""
enabled: false
production: false
key: staging
repository: environment-cic-60-staging
- ingress:
domain: ""
externalDNS: false
namespaceSubDomain: ""
tls:
email: ""
enabled: false
production: false
key: production
repository: environment-cic-60-production
gitops: true
ingress:
domain: 172.29.35.81.nip.io
externalDNS: false
namespaceSubDomain: -jx.
tls:
email: ""
enabled: false
production: false
kaniko: true
repository: nexus
secretStorage: local
storage:
backup:
enabled: false
url: ""
logs:
enabled: false
url: ""
reports:
enabled: false
url: ""
repository:
enabled: false
url: ""
vault: {}
velero:
schedule: ""
ttl: ""
versionStream:
ref: v1.0.562
url: https://github.com/jenkins-x/jenkins-x-versions.git
webhook: lighthouse
Expected behavior:
All the pods under jx namespace should be up & running and jenkins-x should be installed properly
Actual behavior:
Tekton pipeline controller pod is into "CrashLoopBackOff" state with error:
Pods with status in "jx" namespace:
NAME READY STATUS RESTARTS AGE
jenkins-x-chartmuseum-5687695d57-pp994 1/1 Running 0 1d
jenkins-x-controllerbuild-78b4b56695-mg2vs 1/1 Running 0 1d
jenkins-x-controllerrole-765cf99bdb-swshp 1/1 Running 0 1d
jenkins-x-docker-registry-5bcd587565-rhd7q 1/1 Running 0 1d
jenkins-x-gcactivities-1598421600-jtgm6 0/1 Completed 0 1h
jenkins-x-gcactivities-1598423400-4rd76 0/1 Completed 0 43m
jenkins-x-gcactivities-1598425200-sd7xm 0/1 Completed 0 13m
jenkins-x-gcpods-1598421600-z7s4w 0/1 Completed 0 1h
jenkins-x-gcpods-1598423400-vzb6p 0/1 Completed 0 43m
jenkins-x-gcpods-1598425200-56zdp 0/1 Completed 0 13m
jenkins-x-gcpreviews-1598421600-5k4vf 0/1 Completed 0 1h
jenkins-x-nexus-c7dcb47c7-fh7kx 1/1 Running 0 1d
lighthouse-foghorn-654c868bc8-d5w57 1/1 Running 0 1d
lighthouse-gc-jobs-1598421600-bmsq8 0/1 Completed 0 1h
lighthouse-gc-jobs-1598423400-zskt5 0/1 Completed 0 43m
lighthouse-gc-jobs-1598425200-m9gtd 0/1 Completed 0 13m
lighthouse-jx-controller-6c9b8994bd-qt6tc 1/1 Running 0 1d
lighthouse-keeper-7c6fd9466f-gdjjt 1/1 Running 0 1d
lighthouse-webhooks-56668dc58b-4c52j 1/1 Running 0 1d
lighthouse-webhooks-56668dc58b-8dh27 1/1 Running 0 1d
tekton-pipelines-controller-76c8c8dd78-llj4c 0/1 CrashLoopBackOff 436 1d
tiller-7ddfd45c57-rwtt9 1/1 Running 0 1d
Error log:
2020/08/24 18:38:00 Registering 4 clients
2020/08/24 18:38:00 Registering 3 informer factories
2020/08/24 18:38:00 Registering 8 informers
2020/08/24 18:38:00 Registering 2 controllers
{"level":"info","caller":"logging/config.go:108","msg":"Successfully created the logger."}
{"level":"info","caller":"logging/config.go:109","msg":"Logging level set to info"}
{"level":"fatal","logger":"tekton","caller":"sharedmain/main.go:149","msg":"Version check failed","commit":"821ac4d","error":"kubernetes version \"v1.11.0\" is not compatible, need at least \"v1.14.0\" (this can be overridden with the env var \"KUBERNETES_MIN_VERSION\")","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain.MainWithConfig\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain/main.go:149\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain.MainWithContext\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain/main.go:114\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/controller/main.go:72\nruntime.main\n\truntime/proc.go:203"}
After downgrading the tekton image from "0.11.0" to "0.9.0" the tekton pipeline controller pod is into running state. And a new tekton pipeline webhook pod got created and it is into "Crashloopbackoff"
Jx version:
Version 2.1.127
Commit 4bc05a9
Build date 2020-08-05T20:34:57Z
Go version 1.13.8
Git tree state clean
Diagnostic information:
The output of jx diagnose version is:
Running in namespace: jx
Version 2.1.127
Commit 4bc05a9
Build date 2020-08-05T20:34:57Z
Go version 1.13.8
Git tree state clean
NAME VERSION
Kubernetes cluster v1.11.0+d4cacc0
kubectl (installed in JX_BIN) v1.16.6-beta.0
helm client 2.16.9
git 2.24.1
Operating System "CentOS Linux release 7.8.2003 (Core)"
Please visit https://jenkins-x.io/faq/issues/ for any known issues.
Finished printing diagnostic information
Kubernetes cluster: openshift - 3.11
Kubectl version:
Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2018-10-15T09:45:30Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Operating system / Environment:
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
I need to install "jenkins-x" via "jx boot" in "openshift-3.11" which uses default kubernetes version - 1.11.0 but "jx boot" requires atleast 1.14.0. Please suggest if there is any work around to get jenkins-x on openshift-3.11
As the error message shows (in the crashloop), kubernetes version "v1.11.0" is not compatible, need at least "v1.14.0", which make it not installable on OpenShift 3 (as it ships with Kubernetes 1.11.0). It seems jenkins-X comes with Tetkon Pipelines v0.14.2 which requires at least Kubernetes 1.14.0 (and later releases like Tekton Pipelines v0.15.0 requires Kubernetes 1.16.0).
{"level":"fatal","logger":"tekton","caller":"sharedmain/main.go:149","msg":"Version check failed","commit":"821ac4d","error":"kubernetes version \"v1.11.0\" is not compatible, need at least \"v1.14.0\" (this can be overridden with the env var \"KUBERNETES_MIN_VERSION\")","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain.MainWithConfig\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain/main.go:149\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain.MainWithContext\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain/main.go:114\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/controller/main.go:72\nruntime.main\n\truntime/proc.go:203"}
Theorically, setting KUBERNETES_MIN_VERSION in the controller deployment might make it work but it is not being tested and most likely the controller won't behave correctly as it's using feature that are not available in 1.11.0. Other than this, there is no workaround that I know of.
Using Artillery.io 1.6.0-10, I call an api that returns a JSON and try to capture one of the values for later use in the flow, however the capture doesn't seem to be working. Here is the simplified code:
get_ddg.yml
config:
target: "https://api.duckduckgo.com"
phases:
- duration: 3
arrivalCount: 1
scenarios:
- name: "Get search"
flow:
- get:
url: "/?q=DuckDuckGo&format=json"
capture:
json: "$.Abstract"
as: "abstract"
- log: "Abstract: {{ $abstract }}"
When I run artillery the value is empty:
$ artillery run get_ddg.yml
Started phase 0, duration: 3s # 10:28:34(+0200) 2017-10-25
⠋ Abstract: <----- EMPTY! NO VALUE FOR $abstract
Report # 10:28:37(+0200) 2017-10-25
Scenarios launched: 1
Scenarios completed: 1
Requests completed: 1
Concurrent users: 1
RPS sent: 2.08
Request latency:
min: 311.9
max: 311.9
median: 311.9
p95: NaN
p99: NaN
Scenario duration:
min: 349.5
max: 349.5
median: 349.5
p95: NaN
p99: NaN
Codes:
200: 1
Any help is much appreciated.
Found the solution. The problem is how the variable is read after capture. The correct way to call the variable it is not using the '$':
- log: "Abstract: {{ abstract }}"
I have a inventory file which has a RDS endpoint as :
[ems_db]
syd01-devops.ce4l9ofvbl4z.ap-southeast-2.rds.amazonaws.com
I wrote the following play book to create a Cloudwatch ALARM :
---
- name: Get instance ec2 facts
debug: var=groups.ems_db[0].split('.')[0]
register: ems_db_name
- name: Display
debug: var=ems_db_name
- name: Create CPU utilization metric alarm
ec2_metric_alarm:
state: present
region: "{{aws_region}}"
name: "{{ems_db_name}}-cpu-util"
metric: "CPUUtilization"
namespace: "AWS/RDS"
statistic: Average
comparison: ">="
unit: "Percent"
period: 300
description: "It will be triggered when CPU utilization is more than 80% for 5 minutes"
dimensions: { 'DBInstanceIdentifier' : ems_db_name }
alarm_actions: arn:aws:sns:ap-southeast-2:493552970418:cloudwatch_test
ok_actions: arn:aws:sns:ap-southeast-2:493552970418:cloudwatch_test
But this results in
TASK: [cloudwatch | Get instance ec2 facts] ***********************************
ok: [127.0.0.1] => {
"var": {
"groups.ems_db[0].split('.')[0]": "syd01-devops"
}
}
TASK: [cloudwatch | Display] **************************************************
ok: [127.0.0.1] => {
"var": {
"ems_db_name": {
"invocation": {
"module_args": "var=groups.ems_db[0].split('.')[0]",
"module_complex_args": {},
"module_name": "debug"
},
"var": {
"groups.ems_db[0].split('.')[0]": "syd01-devops"
},
"verbose_always": true
}
}
}
TASK: [cloudwatch | Create CPU utilization metric alarm] **********************
failed: [127.0.0.1] => {"failed": true}
msg: BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
<Error>
<Type>Sender</Type>
<Code>MalformedInput</Code>
</Error>
<RequestId>f30470a3-2d65-11e6-b7cb-cdbbbb30b60b</RequestId>
</ErrorResponse>
FATAL: all hosts have already failed -- aborting
What is wrong here? What can i do to solve this ? I am new to this but surely this seems some syntax issue with me or the way i am picking up the inventory endpoint split.
The variable from debug isn't being assigned in the first debug statement, though you may be able to if you change it to a message and enclose it with quotes and double braces (untested):
- name: Get instance ec2 facts
debug: msg="{{groups.ems_db[0].split('.')[0]}}"
register: ems_db_name
However, I would use the set_fact module in that task (instead of debug) and assign that value to it. That way, you can reuse it in this and subsequent calls to a play.
- name: Get instance ec2 facts
set_fact: ems_db_name="{{groups.ems_db[0].split('.')[0]}}"
UPDATE: Add a threshold: 80.0 to the last task, and the dimensions needs to use the instance id encapsulated with double braces.