rabbitmq showing wrong disk free limit in management console - configuration

as the title says, I have a problem, that the rabbitmq shows (and thinks) that there is more space available, as I told him.
I'm running 2 instances of rabbitmq 3.8.8 with erlang 23.0 in 2 RHEL pods. To these pods, a dynamically provisioned PersistentVolume is bound of 2GB size on NFS.
That means, that every pod shoud have 1GB of space for himself.
In the rabbitmq.conf I have the following:
vm_memory_high_watermark.relative = 0.9
total_memory_available_override_value = 1000MB
disk_free_limit.absolute = 1GB
management.load_definitions = /etc/rabbitmq/definitions.json
Also when I start Rabbitmq, I see in the log, that the configuration is read correctly:
2020-10-13 08:26:51.726 [info] <0.427.0> Memory high watermark set to 858 MiB (900000000 bytes)
2020-10-13 08:26:51.811 [info] <0.439.0> Enabling free disk space monitoring
2020-10-13 08:26:51.811 [info] <0.439.0> Disk free limit set to 1000MB
the problem is, that rabbitMQ somehow thinks, that there is the whole NFS free space available - 54GB (as on the screenshot above). So I got into a problem, that over 200K messages were stuck in one of the queues, filled up those 2GB of PersistentVolume I gave him, but didn't stopped accepting messages, cause he thought, that there is more space available. Of course, the whole rabbitmq pod crashed, cause it couldn't write more messages to the NFS.
can you please guide me, how to set is correctly?
Or do you know, why rabbitMQ doesnt respect the disk_free_limit.absolute value?
many thanks

rabbitmq-diagnostics environment | grep disk_free_limit
will display the actual effective configuration value.
On Linux, RabbitMQ will use either a configured absolute value, or compute how much disk space its data directory's partition has by running
df -kP /path/to/directory
which is not aware of Kubernetes quotas.
I don't have an NFS partition on Kubernetes to try but a basic test with the following rabbitmq.conf file
disk_free_limit.absolute = 3GB
does not reproduce; the configured value is used as expected. See 1.

Regarding your question : "why rabbitMQ doesnt respect the disk_free_limit.absolute value" - I think it does (even if it treats the free memory wrong of the k8s / pod).
the value is shown in the image you attached as '954 MiB low watermark' - that means that when you'll have only 1 GB of disk free space the broker will stop publishers from publishing and will only allow consumers to consume until there's more space available on disk.
so as long as the machine has more than 1 gb available it continues to accept messages.
perhaps since it wrongly reads that it has 54 GB of free space it crashes but the disk_free_limit.absolute value seems to be read right.

Related

Compute Engine - Automatic scale

I have one VM Compute Engine to host simple apps. My apps is growing and the number of users too.
Now my users work basicaly from 08:00 AM to 07:00 PM, in this period the usage os CPU and Memory is High and the speed of work is very important.
I'm preparing to expand the memory and processor in the next days, but i search a more scalable and cost efective way.
Is there a way for automatic add resources when i need and reduce after no more need?
Thanks
The cost of running your VMs is directly related to a number of different factors i.e. the type of network in use (premium vs standard), the machine type, the boot disk image you use (premium vs open-source images) and the region/zone where your workloads are running, among other things.
Your use case seems to fit managed instance groups (MIGs). With MIGs you essentially configure a template for VMs that share the same attributes. During the configuration of your MIG, you will be able to specify the CPU/memory limit beyond which the MIG autoscaler will kick off. When your CPU/memory reading goes below that threshold, MIG scales your VMs down to the number of instances specified in your template.
You can also use requests per second as a threshold for autoscaling and I would recommend you explore the docs to know more about it.
See docs

Kubernetes on GCE / Prevent pods undergoing an eviction with "The node was low on compute resources."

Painful investigation on aspects that so far are not that highlighted by documentation (at least from what I've googled)
My cluster's kube-proxy became evicted (+-experienced users might be able to consider the faced issues). Searched a lot, but no clues about how to have them up again.
Until describing the concerned pod gave a clear reason : "The node was low on compute resources."
Still not that experienced with resources balance between pods/deployments and "physical" compute, how would one 'prioritizes' (or similar approach) to make sure specific pods will never end up in such a state ?
The cluster has been created with fairly low resources in order to get our hands on while keeping low costs and eventually witnessing such problems (gcloud container clusters create deemx --machine-type g1-small --enable-autoscaling --min-nodes=1 --max-nodes=5 --disk-size=30), is using g1-small is to prohibit ?
If you are using iptables-based kube-proxy (the current best practice), then kube-proxy being killed should not immediately cause your network connectivity to fail, but new services and updates to endpoints will stop working. Still, your apps should continue to work, but degrade slowly.
If you are using userspace kube-proxy, you might want to upgrade.
The error message sounds like it was due to memory pressure on the machine.
When there is memory pressure, Kubelet tries to terminate things in order of lowest to highest QoS level.
If your kube-proxy pod is not using Guaranteed resources, then you might want to change that.
Other things to look at:
if kube-proxy suddenly used a lot more memory, it could be terminated. If you made a huge number of pods or services or endpoints, this could cause it to use more memory.
if you started processes on the machine that are not under kubernetes control, that could cause kubelet to make an incorrect decision about what to terminate. Avoid this.
It is possible that on such a small machine as a g1-small, the amount of node resources held back is insufficient, such that too much guaranteed work got put on the machine -- see allocatable vs capacity. This might need tweaking.
Node oom documentation

Why does the CPU load dropped in the last days?

Anybody has a hint? I didn't change anything in the machine (except for the security updates), and the sites hosted there didn't suffer a significant change in connections.
May be Google changed something in their infrastructure? Coincidentally, it was an issue with the Cloud DNS ManagedZone these days: they charged me with $ 920 for half month usage, and it was an error (they counted thousands of weeks of usage too) so they recently changed back to $ 0,28. May be there was some process that indeed used the Cloud DNS by error and thus consumed CPU power, and they corrected now?
I wish to know what is happening from someone that knows what going on in GC. Thank you.
CPU utilization reporting is now more accurate from a VM guest perspective as it doesn't include virtualization layer overhead anymore. It has nothing to do with Cloud DNS.
See this issue for some extra context:
https://code.google.com/p/google-compute-engine/issues/detail?id=281

gear size of more than one CPU / core in OpenShift Enterprise 2?

I'm setting up OpenShift Enterprise 2 and I'd like to create a district with a larger gear size. Changing
/etc/openshift/resource_limits.conf
on the nodes is straightforward for increasing memory and disk available to the gear, but CPU resource management is less intuitive (from resource_limits.conf):
# cpu cpu_rt_period_us=100000 cpu_rt_runtime_us=950000
cpu_shares=128
cpu_cfs_quota_us=100000
By default, a gear can only consume a maximum of 100% of a single processor core. If I want to allow a bigger gear size that could allow full utilization of 2 processor cores, how would I do that, or is it currently not possible at all in OpenShift?
Since all the gears are the same, and since 'cpu_shares' are compared on a relative basis when restricting a group, I'm not sure it makes sense to change 'cpu_shares'.
However, 'cpu_cfs_quota_us' looks like it might be the right knob to turn. From this page:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
It appears that I should be able to double the quota to get a full 2 cores. However, it's not clear whether OpenShift will respect this, since the 'cpu_cfs_period_us' parameter is not even found in resource_limits.conf.
I performed an experiment using 'stress'. I first confirmed that I could load 2 cores under a normal ssh login (using 'stress --cpu 2'). Then I logged in to a gear on that host and ran the same thing. With cpu_cfs_quota_us=100000, I can only consume a max of 50% CPU for each stress process. But when I change to cpu_cfs_quota_us=200000, I can consume over 99% for each process, so it appears that it is now successful. Would be nice if this was called out in the OpenShift docs...

swap space used while physical memory is free

i recently have migrated between 2 servers (the newest has lower specs), and it freezes all the time even though there is no load on the server, below are my specs:
HP DL120G5 / Intel Quad-Core Xeon X3210 / 8GB RAM
free -m output:
total used free shared buffers cached
Mem: 7863 7603 260 0 176 5736
-/+ buffers/cache: 1690 6173
Swap: 4094 412 3681
as you can see there is 412 mb ysed in swap while there is almost 80% of the physical ram available
I don't know if this should cause any trouble, but almost no swap was used in my old server so I'm thinking this does not seem right.
i have cPanel license so i contacted their support and they noted that i have high iowait, and yes when i ran sar i noticed sometimes it exceeds 60%, most often it's 20% but sometimes it reaches to 60% or even 70%
i don't really know how to diagnose that, i was suspecting my drive is slow and this might cause the latency so i ran a test using dd and the speed was 250 mb/s so i think the transfer speed is ok plus the hardware is supposed to be brand new.
the high load usually happens when i use gzip or tar to extract files (backup or restore a cpanel account).
one important thing to mention is that top is reporting that mysql is using 100% to 125% of the CPU and sometimes it reaches much more, if i trace the mysql process i keep getting this error continually:
setsockopt(376, SOL_IP, IP_TOS, [8], 4) = -1 EOPNOTSUPP (Operation not supported)
i don't know what that means nor did i get useful information googling it.
i forgot to mention that it's a web hosting server for what it's worth, so it has the standard setup for web hosting (apache,php,mysql .. etc)
so how do i properly diagnose this issue and find the solution, or what might be the possible causes?
As you may have realized by now, the free -m output shows 7603MiB (~7.6GiB) USED, not free.
You're out of memory and it has started swapping which will drastically slow things down. Since most applications are unaware that the virtual memory is now coming from much slower disk, the system may very well appear to "hang" with no feedback describing the problem.
From your description, the first process I'd kill in order to regain control would be the Mysql. If you have ssh/rsh/telnet connectivity to this box from another machine, you may have to login from that in order to get a usable commandline to kill from.
My first thought (hypothesis?) for what's happening is...
MySQL is trying to do something that is not supported as this machine is currently configured. It could be missing a library or an environment variable is not set or any number things.
That operation allocates some memory but is failing and not cleaning up the allocation when it does. If this were a shell script, it could be fixed by putting an event trap command at the beginning that runs a function that releases memory and cleans up.
The code is written to keep retrying on failure so it rapidly uses up all your memory. Refering back to the shell script illustration, the trap function might also prompt to see if you really want to keep retrying.
Not a complete answer but hopefully will help.