Openshift 3.11 atomic-openshift-node - Service Restart Impact

Openshift 3.11 atomic-openshift-node - Service Restart Impact - openshift

We have a plan to scaleup the Max Pod Per node from 200 to 350.
Based on the documentation - in order for the new node config to take effect the atomic-openshift-node service needs to be restarted.
This cluster where the node is located, served business critical DCs, PODS, services, routes etc.
The question is, what are the possible operational impact during the restart of the atomic service if any? Or is it totally no direct impact to the applications?
Ref: https://docs.openshift.com/container-platform/3.3/install_config/master_node_configuration.html

Containers should run fine without interruption. But nodeservice (node controller) in some cases can send "NotReady" status for some PODs. I dont know what is the cause. I suspect a race condition, guessing the dependence on the time of restart and readiness probe parameters, maybe on other node performance conditions.
This may result in the service being unavailable for a while if it will be removed from the "router" backends.
Most likely, in case only one node is changed at time and important applications are well scaled with HA rules in mind, there should be no "business impact".
But, in case of node control configmap change (3.11, new design introduced long after the original question), possible many node controllers are restarted in parallel (in fact, it does not happen immediately, but still within a short period), which I consider as problematic consequence of node controll configmap concept (one for all appnodes).

Related

Google Cloud auto-scaling thrashes between 0 and 1, even with minimum of 1

I have a managed instance group with auto scaling enabled. I have a minimum of 1 and maximum of 10 with health checks and cpu 0.8
The number of instances continually switches between 0 and 1. Every few minutes. I am unable to find the reason GCP decides to remove an instance and then immediately add it back. Health checks have no logs anywhere.
More concerning is that the minimum instances required is violated.
Thoughts? Thanks!
Edit: This may be due to instances becoming unhealthy, most likely because a firewall rule was needed to allow health checks on the instances. The health check worked for load balancing, but not instance health it seems. I'm using a custom network, so I needed to add the firewall rule.
https://cloud.google.com/compute/docs/load-balancing/health-checks#configure_a_firewall_rule_to_allow_health_checking
Will confirm/update after some monitoring time.

Don't be confused between two different features: the autohealer and
the autoscaling of Managed Instance Groups.
The --min-num-replicas is a parameter of the autoscaler, setting this parameter you are sure that the target number of instances will never be set below a certain threshold. However the autohealing works on its own not following the configuration of the autoscaling.
Therefore when instances belong to a managed group and fail health checks, they are considered dead instances and removed from the pool if autohealing is enabled without taking into account the minimum number of replicas.
It is always best practise to verify that health checks are working properly in order to avoid this kind of misbehaviour. The common issues are:
Firewall rules
Wrong protocols/ports
Server not starting automatically at the powerup of the machine
Notice also that if the health checks are a bit more complex and interact with some kind of software you need to be sure that the instance is started before configuring accordingly initial delay flag, i.e. the length of the period during which the instance is known to be initializing and should not be autohealed even if unhealthy.

What would be an ideal way to share writable volume across containers for a web server?

The application in question is Wordpress, I need to create replicas for rolling deployment / scaling purposes.
It seem can't create more then 1 instance of the same container, if it uses a persistent volume (GCP term):
The Deployment "wordpress" is invalid: spec.template.spec.volumes[0].gcePersistentDisk.readOnly: Invalid value: false: must be true for replicated pods > 1; GCE PD can only be mounted on multiple machines if it is read-only
What are my options? There will be occasional writes and many reads. Ideally writable by all containers. I'm hesitant to use the network file systems as I'm not sure whether they'll provide sufficient performance for a web application (where page load is rather critical).
One idea I have is, create a master container (write and read permission) and slaves (read only permission), this could work - I'll just need to figure out the Kubernetes configuration required.

In https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistent-volumes you can see a table with the possible storage classes that allow ReadWriteMany (the option you are looking for).
AzureFile (not suitable if you are using GCP)
CephFS
Glusterfs
Quobyte
NFS
PortworxVolume
The one that I've tried is that of NFS. I had no issues with it, but I guess you should also consider potential performance issues. However, if the writes are to be occassional, it shouldn't be much of an issue.

I think what you are trying to solve is having a central location for wordperss media files, in that case this would be a better solution: https://wordpress.org/plugins/gcs/
Making your kubernetes workload truly stateless and you can scale horizontally.

You can use Regional Persistent Disk. It can be mounted to many nodes (hence pods) in RW more. These nodes can be spread across two zones within one region. Regional PDs can be backed by standard or SSD disks. Just note that as of now (september 2018) they are still in beta and may be subject to backward incompatible changes.
Check the complete spec here:
https://cloud.google.com/compute/docs/disks/#repds

Accessing heapster metrics in Kubernetes code

I would like to expand/shrink the number of kubelets being used by kubernetes cluster based on resource usage. I have been looking at the code and have some idea of how to implement it at a high level.
I am stuck on 2 things:
What will be a good way for accessing the cluster metrics (via Heapster)? Should I try to use the kubedns for finding the heapster endpoint and directly query the API or is there some other way possible? Also, I am not sure on how to use kubedns to get the heapster URL in the former.
The rescheduler which expands/shrinks the number of nodes will need to kick in every 30 minutes. What will be the best way for it. Is there some interface or something in the code which I can use for it or should I write a code segment which gets called every 30 mins and put it in the main loop?
Any help would be greatly appreciated :)

Part 1:
What you said about using kubedns to find heapster and querying that REST API is fine.
You could also write a client interface that abstracts the interface to heapster -- that would help with unit testing.
Take a look at this metrics client:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/metrics/metrics_client.go
It doesn't do exactly what you want: it gets per-Pod stats instead of per-cluster or per-node stats. But you could modify it.
In function getForPods, you can see the code that resolves the heapster service and connects to it here:
resultRaw, err := h.client.Services(h.heapsterNamespace).
ProxyGet(h.heapsterService, metricPath, map[string]string{"start": startTime.Format(time.RFC3339)}).
DoRaw()
where heapsterNamespace is "kube-system" and heapsterService is "heapster".
That metrics client is part of the "horizonal pod autoscaler" implementation. It is solving a slightly different problem, but you should take a look at it if you haven't already. If is described here: https://github.com/kubernetes/kubernetes/blob/master/docs/design/horizontal-pod-autoscaler.md
FYI: The Heapster REST API is defined here:
https://github.com/kubernetes/heapster/blob/master/docs/model.md
You should poke around and see if there are node-level or cluster-level CPU metrics that work for you.
Part 2:
There is no standard interface for shrinking nodes. It is different for each cloud provider. And if you are on-premises, then you can't shrink nodes.
Related discussion:
https://github.com/kubernetes/kubernetes/issues/11935
Side note: Among kubernetes developers, we typically use the term "rescheduler" when talking about something that rebalances pods across machines, by removing a pod from one machine and creates the same kind of pod on another machine. That is a different thing than the thing you are talking about building. We haven't built a rescheduler yet, but there is an outline of how to build one here:
https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduler.md

RabbitMQ: routing on multiple criteria but hitting only one consumer

How can we distribute work via RabbitMQ such that worker pools can subscribe to work messages based on differing (but frequently overlapping) criteria from each other but such that when a message is routed that matches to multiple worker pools, only one worker will pick up the job?
Simplified example:
We have host1 and host2.
Host1 handles jobs of classA and classB; host2 handles jobs of classB and classC.
If we route a job of
classA, only host1 will pick it up; if we route a job of classB,
either host1 or host2 will pick it up (based on their current load /
first available) but never both.
It would seem that we need to use a topic exchange, as our routing criteria is complex and using wildcards gives us the type of flexible matching we want.
However:
If we use the same name for the worker pool queue (say “worker-jobs”) we get the desired work splitting out to arbitrary matching workers, but every worker subscribing to the named queue seems to infect the other workers with each other’s routing criteria as they bind it. I.e. the binding of the routing key seems to be at the central queue name level not on a connection-to-queue basis.
If we use different queue names for each worker pool connection (say “poolA-jobs” and “poolB-jobs”) to the same exchange then we get the desired behavior with the different routing criteria maintained between pools but a job coming in that can match to both poolA and poolB gets routed to both of them (albeit only to one worker in each).
Notes:
I’ve spared you the details of why but suffice to say we have an existing multi-petabyte distributed search application that needs response times < 50ms. We already achieve this with our own custom routing hub but we’d like to replace this with RabbitMQ as its performance is attractive (as is retiring homemade code that overlaps with general purpose community projects) if we can get the sophisticated routing we need.
We use Python
Disco isn’t viable for many reasons, too numerous to go into.
It doesn’t have to be RabbitMQ but the performance needs to be as good. ØMQ looks very interesting and like it might provide both the flexibility and the performance but we’re already using RabbitMQ and after wading through the first half of the colorfully written ØMQ guide I’m still not sure if it will support the routing we need but it does look like we’ll have to pretty much write a broker to do it.
We actually have the luxury of knowing which hosts are capable of serving which jobs, so we can do something like have host1 subscribe to #.host1.# and host2 to #.host2.#. Then when we route a classB message, we can give it a key of host1.host2 to indicate which backends are acceptable for service. This simplifies the routing rules but still doesn’t overcome the problem described.

How to get your code ready for Loadbalancing

As we did this in the past, i'd like to gather useful information for everyone moving to loadbalancing, as there are issues which your code must be aware of.
We moved from one apache server to squid as reverse proxy/loadbalancer with three apache servers behind.
We are using PHP/MySQL, so issues may differ.
Things we had to solve:
Sessions
We moved from "default" php sessions (files) to distributed memcached-sessions. Simple solution, has to be done. This way, you also don't need "sticky sessions" on your loadbalancer.
Caching
To our non-distributed apc-cache per webserver, we added anoter memcached-layer for distributed object caching, and replaced all old/outdated filecaching systems with it.
Uploads
Uploads go to a shared (nfs) folder.
Things we optimized for speed:
Static Files
Our main NFS runs a lighttpd, serving (also user-uploaded) images. Squid is aware of that and never queries our apache-nodes for images, which gave a nice performance boost. Squid is also configured to cache those files in ram.
What did you do to get your code/project ready for loadbalancing, any other concerns for people thinking about this move, and which platform/language are you using?

When doing this:
For http nodes, I push hard for a single system image (ocfs2 is good for this) and use either pound or crossroads as a load balancer, depending on the scenario. Nodes should have a small local disk for swap and to avoid most (but not all) headaches of CDSLs.
Then I bring Xen into the mix. If you place a small, temporal amount of information on Xenbus (i.e. how much virtual memory Linux has actually promised to processes per VM aka Committed_AS) you can quickly detect a brain dead load balancer and adjust it. Oracle caught on to this too .. and is now working to improve the balloon driver in Linux.
After that I look at the cost of splitting the database usage for any given app across sqlite3 and whatever db the app wants natively, while realizing that I need to split the db so posix_fadvise() can do its job and not pollute kernel buffers needlessly. Since most DBMS services want to do their own buffering, you must also let them do their own clustering. This really dictates the type of DB cluster that I use and what I do to the balloon driver.
Memcache servers then boot from a skinny initrd, again while the privileged domain watches their memory and CPU use so it knows when to boot more.
The choice of heartbeat / takeover really depends on the given network and the expected usage of the cluster. Its hard to generalize that one.
The end result is typically 5 or 6 physical nodes with quite a bit of memory booting a virtual machine monitor + guests while attached to mirrored storage.
Storage is also hard to describe in general terms.. sometimes I use cluster LVM, sometimes not. The not will change when LVM2 finally moves away from its current string based API.
Finally, all of this coordination results in something like Augeas updating configurations on the fly, based on events communicated via Xenbus. That includes ocfs2 itself, or any other service where configurations just can't reside on a single system image.
This is really an application specific question .. can you give an example? I love memcache, but not everyone can benefit from using it, for instance. Are we reviewing your configuration or talking about best practices in general?
Edit:
Sorry for being so Linux centric ... its typically what I use when designing a cluster.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008