I would like to expand/shrink the number of kubelets being used by kubernetes cluster based on resource usage. I have been looking at the code and have some idea of how to implement it at a high level.
I am stuck on 2 things:
What will be a good way for accessing the cluster metrics (via Heapster)? Should I try to use the kubedns for finding the heapster endpoint and directly query the API or is there some other way possible? Also, I am not sure on how to use kubedns to get the heapster URL in the former.
The rescheduler which expands/shrinks the number of nodes will need to kick in every 30 minutes. What will be the best way for it. Is there some interface or something in the code which I can use for it or should I write a code segment which gets called every 30 mins and put it in the main loop?
Any help would be greatly appreciated :)
Part 1:
What you said about using kubedns to find heapster and querying that REST API is fine.
You could also write a client interface that abstracts the interface to heapster -- that would help with unit testing.
Take a look at this metrics client:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/metrics/metrics_client.go
It doesn't do exactly what you want: it gets per-Pod stats instead of per-cluster or per-node stats. But you could modify it.
In function getForPods, you can see the code that resolves the heapster service and connects to it here:
resultRaw, err := h.client.Services(h.heapsterNamespace).
ProxyGet(h.heapsterService, metricPath, map[string]string{"start": startTime.Format(time.RFC3339)}).
DoRaw()
where heapsterNamespace is "kube-system" and heapsterService is "heapster".
That metrics client is part of the "horizonal pod autoscaler" implementation. It is solving a slightly different problem, but you should take a look at it if you haven't already. If is described here: https://github.com/kubernetes/kubernetes/blob/master/docs/design/horizontal-pod-autoscaler.md
FYI: The Heapster REST API is defined here:
https://github.com/kubernetes/heapster/blob/master/docs/model.md
You should poke around and see if there are node-level or cluster-level CPU metrics that work for you.
Part 2:
There is no standard interface for shrinking nodes. It is different for each cloud provider. And if you are on-premises, then you can't shrink nodes.
Related discussion:
https://github.com/kubernetes/kubernetes/issues/11935
Side note: Among kubernetes developers, we typically use the term "rescheduler" when talking about something that rebalances pods across machines, by removing a pod from one machine and creates the same kind of pod on another machine. That is a different thing than the thing you are talking about building. We haven't built a rescheduler yet, but there is an outline of how to build one here:
https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduler.md
Related
Could you help me in one thing?
I have got a project with few deployments - databases, app1, app2, monitoring etc.
I have a very specific system that needs to be scaled with multiple metrics that are stored in monitoring system.
I have created a small microservice which checks if conditions are met (it's not something like simple GET - there is a whole algorithm that calculates it) and scales the environment (oc app or curl - it doesn't matter at this point).
Here's my question - is it a good solution? I wonder if it can be done in a better way.
And the second one - is it okay to create new serviceaccount with edit role just to perform scaling?
I know it's not complicated when you have some experience in openshift but this system is so specific (autoscaling managed by algorithms) that I couldn't find anything useful in docs.
Thank you.
The application in question is Wordpress, I need to create replicas for rolling deployment / scaling purposes.
It seem can't create more then 1 instance of the same container, if it uses a persistent volume (GCP term):
The Deployment "wordpress" is invalid: spec.template.spec.volumes[0].gcePersistentDisk.readOnly: Invalid value: false: must be true for replicated pods > 1; GCE PD can only be mounted on multiple machines if it is read-only
What are my options? There will be occasional writes and many reads. Ideally writable by all containers. I'm hesitant to use the network file systems as I'm not sure whether they'll provide sufficient performance for a web application (where page load is rather critical).
One idea I have is, create a master container (write and read permission) and slaves (read only permission), this could work - I'll just need to figure out the Kubernetes configuration required.
In https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistent-volumes you can see a table with the possible storage classes that allow ReadWriteMany (the option you are looking for).
AzureFile (not suitable if you are using GCP)
CephFS
Glusterfs
Quobyte
NFS
PortworxVolume
The one that I've tried is that of NFS. I had no issues with it, but I guess you should also consider potential performance issues. However, if the writes are to be occassional, it shouldn't be much of an issue.
I think what you are trying to solve is having a central location for wordperss media files, in that case this would be a better solution: https://wordpress.org/plugins/gcs/
Making your kubernetes workload truly stateless and you can scale horizontally.
You can use Regional Persistent Disk. It can be mounted to many nodes (hence pods) in RW more. These nodes can be spread across two zones within one region. Regional PDs can be backed by standard or SSD disks. Just note that as of now (september 2018) they are still in beta and may be subject to backward incompatible changes.
Check the complete spec here:
https://cloud.google.com/compute/docs/disks/#repds
I have been trying out kubernetes on GCP to build microservices for a while and it has been amazing so far.
Although i am a bit confused on what would probably be the best approach, should i
create (gcloud container clusters create "[cluster-name]" ...) one container-cluster per service?
create one container-cluster for multiple services? or
do both of those above depending on my situation?
all of the examples i have managed to find has only covered #2, but my hunch is kind of telling me that i should do #1, and my hunch is also kind of telling me that i have probably missed some basic understanding around containers, i have been trying to find answers without any luck, i guess i just can't figure out the right search keyword, i am hoping that i could find some answer here.
I suspect the answer is "it depends" (Option 3)
How isolated do you need each application to be? How much redundancy (tolerance of VM failure) do you need? How many developers will have access to the Kubernetes cluster? How much traffic do the apps generate?
If in doubt I recommend running all your apps on a single cluster. Simpler to manage and the overhead in providing highly available infrastructure is then shared. You'll also have greater levels of VM utilization, which perhaps might result in reduced hosting costs. The latter is a subjective observation, some apps have very little or occasional traffic resulting in very bored servers :-)
My first time I have asked a question on here.
I have an expanding set of services hosted on google compute platform.
The initial round was set up in a very stressed situation, and I am now refactoring.
I currently have 3 EDIT: no thats 4 microservice VM hosts, which will all be HTTPS soon (and so need their own IP). In addition a list of test boxes, as we are developing bits. Test boxes do not need https.
question 1) Does any one have a work-round to get multiple static IPs per host? This is why i have large numbers of hosts.
question 2) How can I have more than a /29 of static IPs (eg 8 or more). This is corporate work, we will pay for services.
question 3) According to google api, I may deallocate static IPs. I cannot find an implementation for this. Do you know of one? As I have built systems like this in the past; I know there is no technical reason why there should not be an API for this.
Bonus Q, Question 4) Is there a mechanism to serialise a saved harddisk out of google cloud? This would make my CEO happy.
An ideal response is a relevent "Fine Manual" to read.
I work on GMT time. All linux hosts, probably not relevant. Although a developer, I can admin most things Linux.
UPDATE: if you delete an IP via gcloud compute addresses delete $name --region europe-west1 but don't delete the IF inside the box, this makes it not static. Which is the objective of Q3.
You can find the answers to your question below:
Its directly not possible to assign multiple IPs to an instance. One workaround to achieve this is to create multiple forwarding rules pointing to the same target pool with that instance.
Its currently not possible to reserve the whole block of IP addresses as the address are randomly assigned to the instances from the pool of IPs available.
If you have reserved static IPs in your project you can can release that IP from one instance and assign it to another.
There is no direct way to that, however one workaround I can think of is to use dd tool to clone your disk as .raw and save that to cloud storage. This clone case be used to create other disks outside your project.
I hope that helps.
As we did this in the past, i'd like to gather useful information for everyone moving to loadbalancing, as there are issues which your code must be aware of.
We moved from one apache server to squid as reverse proxy/loadbalancer with three apache servers behind.
We are using PHP/MySQL, so issues may differ.
Things we had to solve:
Sessions
We moved from "default" php sessions (files) to distributed memcached-sessions. Simple solution, has to be done. This way, you also don't need "sticky sessions" on your loadbalancer.
Caching
To our non-distributed apc-cache per webserver, we added anoter memcached-layer for distributed object caching, and replaced all old/outdated filecaching systems with it.
Uploads
Uploads go to a shared (nfs) folder.
Things we optimized for speed:
Static Files
Our main NFS runs a lighttpd, serving (also user-uploaded) images. Squid is aware of that and never queries our apache-nodes for images, which gave a nice performance boost. Squid is also configured to cache those files in ram.
What did you do to get your code/project ready for loadbalancing, any other concerns for people thinking about this move, and which platform/language are you using?
When doing this:
For http nodes, I push hard for a single system image (ocfs2 is good for this) and use either pound or crossroads as a load balancer, depending on the scenario. Nodes should have a small local disk for swap and to avoid most (but not all) headaches of CDSLs.
Then I bring Xen into the mix. If you place a small, temporal amount of information on Xenbus (i.e. how much virtual memory Linux has actually promised to processes per VM aka Committed_AS) you can quickly detect a brain dead load balancer and adjust it. Oracle caught on to this too .. and is now working to improve the balloon driver in Linux.
After that I look at the cost of splitting the database usage for any given app across sqlite3 and whatever db the app wants natively, while realizing that I need to split the db so posix_fadvise() can do its job and not pollute kernel buffers needlessly. Since most DBMS services want to do their own buffering, you must also let them do their own clustering. This really dictates the type of DB cluster that I use and what I do to the balloon driver.
Memcache servers then boot from a skinny initrd, again while the privileged domain watches their memory and CPU use so it knows when to boot more.
The choice of heartbeat / takeover really depends on the given network and the expected usage of the cluster. Its hard to generalize that one.
The end result is typically 5 or 6 physical nodes with quite a bit of memory booting a virtual machine monitor + guests while attached to mirrored storage.
Storage is also hard to describe in general terms.. sometimes I use cluster LVM, sometimes not. The not will change when LVM2 finally moves away from its current string based API.
Finally, all of this coordination results in something like Augeas updating configurations on the fly, based on events communicated via Xenbus. That includes ocfs2 itself, or any other service where configurations just can't reside on a single system image.
This is really an application specific question .. can you give an example? I love memcache, but not everyone can benefit from using it, for instance. Are we reviewing your configuration or talking about best practices in general?
Edit:
Sorry for being so Linux centric ... its typically what I use when designing a cluster.