In the prometheus configuration I have a job with these specs:
- job_name: name_of_my_job
scrape_interval: 5m
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
The script that creates the metrics takes 3 minutes to finish, but from prometheus I don't see the metrics. What is the operation of the scrape_timeout variable?
Every 5 minutes (scrape_interval) Prometheus will get the metrics from the given URL. It will try 30 seconds (scrape_timeout) to get the metrics if it can't scrape in this time it will time out.
Related
I have a Spring Boot application running on AWS Elastic Beanstalk. There are multiple application instances running. The number of running applications might dynamically increase and decrease from 2 to 10 instances.
I have set up Prometheus, but scraping the metrics has a critical limitation: it is only able to scrape the Elastic Beanstalk load balancer. This means that every scrape will return a different instance (round robin), so the metrics fluctuate wildly.
# prometheus.yml
scrape_configs:
- job_name: "my-backend"
metrics_path: "/metrics/prometheus"
scrape_interval: 5s
static_configs:
- targets: [ "dev.my-app.website.com" ] # this is a load balancer
labels:
application: "my-backend"
env: "dev"
(I am pursuing a correct set up, where Prometheus can directly scrape from the instances, but because of business limitations this is not possible - so I would like a workaround.)
As a workaround I have added a random UUID label to each application instance using RandomValuePropertySource
# application.yml
management:
endpoints:
enabled-by-default: false
web:
exposure:
include: "*"
endpoint:
prometheus.enabled: true
metrics:
tags:
instance_id: "${random.uuid}" # used to differentiate instances behind load-balancer
This means that the metrics can be uniquely identified, so on one refresh I might get
process_uptime_seconds{instance_id="6fb3de0f-7fef-4de2-aca9-46bc80a6ed27",} 81344.727
While on the next I could get
process_uptime_seconds{instance_id="2ef5faad-6e9e-4fc0-b766-c24103e111a9",} 81231.112
Generally this is fine, and helps for most metrics, but it is clear that Prometheus gets confused and doesn't store the two results separately. This is a particular problem for 'counters', as they are supposed to only increase, but because the different instances handle different requests, the counter might increase or decrease. Graphs end up jagged and disconnected.
I've tried relabelling the instance label (I think that's how Prometheus decides how to store the data separately?), but this doesn't seem to have any effect
# prometheus.yml
scrape_configs:
- job_name: "my-backend"
metrics_path: "/metrics/prometheus"
scrape_interval: 5s
static_configs:
- targets: [ "dev.my-app.website.com" ] # this is a load balancer
labels:
application: "my-backend"
env: "dev"
metric_relabel_configs:
- target_label: instance
source_labels: [__address__, instance_id]
separator: ";"
action: replace
To re-iterate: I know this is not ideal, and the correct solution is to directly connect - that is in motion and will happen eventually. For now, I'm just trying to improve my workaround, so I can get something working sooner.
I use a deep learning models written in pytorch_lightning (pytorch) and train them on slurm clusters. I submit job like this:
sbatch --gpus=1 -t 100 python train.py
When requested GPU time ends, slurm kills my program and shows such message:
Epoch 0: : 339it [01:10, 4.84it/s, loss=-34] slurmstepd: error: *** JOB 375083 ON cn-007 CANCELLED AT 2021-10-04T22:20:54 DUE TO TIME LIMIT ***
How can I configure a Trainer to save model when available time end?
I know about automatic saving after each epoch, but I have only one long epoch that lasts >10 hours, so this case is not working for me.
You can use Slurm's signalling mechanism to pass a signal to your application when it's within a certain number of seconds of the timelimit (see man sbatch). In your submission script use --signal=USR1#30 to send USR1 30 seconds before the timelimit is reached. Your submit script would contain these lines:
#SBATCH -t 100
#SBATCH --signal=USR1#30
srun python train.py
Then, in your code, you can handle that signal like this:
import signal
def handler(signum, frame):
print('Signal handler got signal ', signum)
# e.g. exit(0), or call your pytorch save routines
# enable the handler
signal.signal(signal.SIGUSR1, handler)
# your code here
You need to call your Python application via srun in order for Slurm to be able to propagate the signal to the Python process. (You can probably use --signal on the command line to sbatch, I tend to prefer writing self-contained submit scripts :))
Edit: This link has a nice summary of the issues involved with signal propagation and Slurm.
Getting error message while doing automation parallel execution with TestRail integration as below -
TestRail API returned HTTP 429("API Rate Limit Exceeded - 180 per minute maximum allowed. Retry after 1 seconds.")
Ideally, it would be very difficult to manage testrail to log all execution details when we choose parallel execution.
What you could try as an alternative is to aggregate some of the results up and post them in batches.
http://docs.gurock.com/testrail-api2/reference-results#add_results_for_cases
It will allow you to attach multiple results for a test run at the same time. So you could group all results in the teardown fixture of a class or a full suite of tests and post all results.
Would that help?
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/limit-connection: "1"
nginx.ingress.kubernetes.io/limit-rpm: "20"
and the container image version, iam using,
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.22.0
trying to send 200 requests in ten mins of range (and per min it is like a 20 requests from a single ipaddress) and after that it has to refuse the requests.
Which nginx ingress version are you using ? please use quay.io/aledbf/nginx-ingress-controller:0.415 and then check, Also Please look at this link - https://github.com/kubernetes/ingress-nginx/issues/1839
Try to change this limit-connection: to limit-connections:
For more info check this
If doesn't help, please put your commands or describe that how are you testing your connection limits.
I changed it to the limit-connections, I am mentioning the annotations in the ingress yml file and applying it and i can in the nginx conf the following
`worker_rlimit_nofile 15360;
limit_req_status 503;
limit_conn_status 503;
# Ratelimit test_nginx
# Ratelimit test_nginx `
` map $whitelist_xxxxxxxxxxxx $limit_xxxxxxxxxx {
limit_req_zone $limit_xxxxxxxx zone=test_nginx_rpm:5m rate=20r/m;
limit_req zone=test_nginx_rpm burst=100 nodelay;
limit_req zone=test_nginx_rpm burst=100 nodelay;
limit_req zone=test_nginx_rpm burst=100 nodelay;`
when i kept this annotations,
` nginx.ingress.kubernetes.io/limit-connections: "1"
nginx.ingress.kubernetes.io/limit-rpm: "20" `
I can see the above burst and other things in the nginx conf file, can you please tell me these make any differences ?
There are two things that could be making you experience rate-limits higher than configured: burst and nginx replicas.
Burst
As you have already noted in https://stackoverflow.com/a/54426317/3477266, nginx-ingress adds a burst configuration to the final config it creates for the rate-limiting.
The burst value is always 5x your rate-limit value (it doesn't matter if it's a limit-rpm or limit-rps setting.)
That's why you got a burst=100 from a limit-rpm=20.
You can read here the effect this burst have in Nginx behavior: https://www.nginx.com/blog/rate-limiting-nginx/#bursts
But basically it's possible that Nginx will not return 429 for all request you would expect, because of the burst.
The total number of requests routed in a given period will be total = rate_limit * period + burst
Nginx replicas
Usually nginx-ingress is deployed with Horizontal Pod AutoScaler enabled, to scale based on demand. Or it's explicitly configured to run with more than 1 replica.
In any case, if you have more than 1 replica of Nginx running, each one will handle rate-limiting individually.
This basically means that your rate-limit configuration will be multiplied by the number of replicas, and you could end up with rate-limits a lot higher than you expected.
There is a way to use a memcached instance to make them share the rate-limiting count, as described in: https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#global-rate-limiting
I'm always pinging my ISP gateway whenever I'm on my PC. My ISP likes to throttle my connection, so I have to keep a close eye on it. Is there anyway to write the ping statistics, and just the statistics to a file every hour on the hour?
EDIT: I run ping -D 10.0.0.1 in a terminal, and I would like to be able to save only the summary or statistics that is printed whenever you press CTRL+\ (SIGQUIT) to a file every hour. So the file would look like so:
[1446131810] 100/100 packets, 0% loss, min/avg/ewma/max = 1.818/3.493/3.918/4.254 ms
[1446191810] 200/200 packets, 0% loss, min/avg/ewma/max = 1.818/3.493/3.918/4.254 ms
[1446251810] 300/300 packets, 0% loss, min/avg/ewma/max = 1.818/3.493/3.918/4.254 ms
Maybe you could make a script with endless loop that will make one ping each iteration and write to a log file if there is a failure.
Take a look at this forum thread: http://www.computerhope.com/forum/index.php?topic=120285.0