I am using Google's Kubernetes Engine to manage a cluster with several node pools. Each pool has different configurations (ex. not all have auto-scaling).
The pools are mostly unused during the night, and so I would like to reduce resource consumption and cost during this period (about 10 hours).
I've considered stopping VM instances at the end of the day and restarting them in the morning. Additionally I could temporarily scale down the number of nodes by running gcloud container clusters resize $CLUSTER_NAME --size=0
What would be the best option to reduce costs during unused periods? Is there a better way?
Using cluster autoscaler (which adjusts number of nodes in your node pools) will not be able to scale all your node pools to zero. This is because there are some system pods running in your cluster (kubectl get pods -n kube-system).
You can however, force scaling down of node pools to zero as you pointed out, with a script calling:
gcloud container clusters resize $CLUSTER --num-nodes=0 [--node-pool=$POOL]
Related
New to openshift so I am bit confused if we can run user pods on master or infra nodes. We have 2 worker nodes and one master and infra nodes each making 4 nodes. The reason for change is to share the loads between all 4 nodes rather than 2 compute nodes.
By reading some documents it seems possible to assign 2 roles to one node but is there any security risk or is it not best practice?
Running on openshift version v3.11.0+d699176-406
if we can run user pods on master or infra nodes
Yes, you absolutely can, the easiest way is to configure that at installation time, refer to https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-using-native-ha-ai for example
is there any security risk or is it not best practice
Running single master node or a single infra node is already a risk to high availability of your cluster. If master fails - your cluster is basically headless, if infra node fails - you lose your internal registry and routers - thus losing an external access or an ability to create new images for your imagestreams. This also applies to host OS upgrades, you will have to reboot master and infra nodes some day, are you okay with having a guaranteed downtime during patching? What if something goes wrong during update?
Regarding running user workload on master and infra nodes - if you are not running privileged SCCs (which can allow to run privileged pods or using any uids on system etc) - you are somewhat safe from container breach, assuming there are no known bugs in container engine you are using. However you should pay close attention to your resource consumption and avoid running any workloads without CPU and memory limits, because overloading master node may result in cluster degradation. You should also monitor disk usage, since running user pods results in more images loaded to your docker storage.
So basically it boils down to this:
Better have multiple master (ideally 3) and couple of infra nodes than a single point of failure for both of these roles. Having separate master and worker nodes is of course better than hosting them together, but multimaster setup even with user workloads should be more resilient if you are watching resource usage carefully.
For the past two years, I've been testing the performance of our software in the Google Compute Engine (GCE), using up to a total of 1k vCPUs for the worker-VMs and the VMs for MySQL NDB Cluster that we use.
I have automated the creation of worker VMs (that run our software) with the help of templates and instance groups, but for the MySQL NDB cluster I've always had two fixed machines which I occasionally resized manually (i.e. change the vCPUs and RAM).
I am now trying to bring up a variable number of VMs for the NDB cluster with different amounts of vCPUs and RAM, and I'd like to automate this process in my test suite. The main problem is that the NDB config expects fixed HostNames in the config. To use GCE templates and instance groups, I would be able to bring up only dynamically named instances like ndb-1ffrd, ndb-i8fhh, ....
To exemplify, here is my current ndb config (one of many) that involves fixed VMs ndb1 and ndb2:
[ndbd]
HostName=ndb1
NodeId=1
NodeGroup=0
[ndbd]
HostName=ndb2
NodeId=2
NodeGroup=0
[ndbd]
HostName=ndb1
NodeId=3
NodeGroup=1
[ndbd]
HostName=ndb2
NodeId=4
NodeGroup=1
I'd like to convert the fixed VMs ndb1/ndb2 into GCE instance groups where I can choose to bring an arbitrary number of such instances up (typically 2 or 4 though) for a test and then destroy the VMs afterwards, and perform this on-demand in an automated fashion during my tests. Reasoning behind this is to have repeatable tests with differently configured VMs. Changing many parameters manually over several tests leads to a nightmare when figuring out what the exact configuration was 10 tests ago -- this way each test would refer to a specific instance template for the ndb VMs.
However, GCE instance group members have a random suffix in their name and the NDB config expects fixed HostNames or IPs. So I'd need to either:
Have GCE generate instances from instance groups named in a deterministic way (e.g. ndb1, ndb2, ndb3, ...), so that I can rely on those names in my ndb configs, or
Somehow allow arbitrary hosts (or hosts with arbitrary suffixes) to connect as ndb nodes but still make sure that the same host isn't added to the same NodeGroup more than once -- something that is manually ensured in the above sample config.
Is there any way to achieve what I'm trying to do?
I think that this may be achieved using scripts along with gcloud SDK, gcloud let's you to launch resources in GCP, there are also many options/flags that may be helpful to set the configurations that you need.
Hope this helps.
I have a cluster that can only run one Pod per node due to our configuration (sometimes Kubernetes will randomly run two on one pod but w/e). Any time I have to update my Deployment which causes a Rolling Update, Kubernetes will simply never finish the update.
The reason for this appears to be that there isn't enough room in the nodes to deploy the new pods from the rolling update.
Now, some of you may say that I may simple increase the cluster size every time I want to perform an update. The problem with that approach is that I have enabled autoscaling on the cluster and the Deployment replicas is set high so that Kubernetes automatically scales with the cluster. This means I can't change the cluster size to accomodate the Rolling Update.
How can I perform a Rolling Update with this configuration?
Can you set maxSurge to 0 and maxUnavailable to some positive value?
I have an AWS t2.micro EC2 instance with docker on it, and I bring up the following instances;
jwilder/nginx-proxy
mysql
wordpress
Which results in something like this docker stats;
CONTAINER MEM USAGE/LIMIT MEM %
wordpress 331.9 MB/1.045 GB 31.77%
nginx 18.32 MB/1.045 GB 1.75%
mysql 172.1 MB/1.045 GB 16.48%
Then, I run siege's default 15 concurrent connections against it, which spawns multiple apache processes, reaching the memory limit of the EC2 instance, crashing docker and bash due to no more memory, requiring my intervention to get it all running again.
I have a couple of questions regarding this.
Am I expecting too much? Should this setup be able to handle 15 concurrent connections? If so, what changes* need to be made?
How can I automate recovery from this? Is there a way to detect that memory is reaching capacity and do something (like reject requests or similar) until memory usage decreases? Is there a way to keep the system stable during the high request volume so once it's over it does not require my intervention to bring it back up?
* I've already done this to drop mysql memory from 22% to 15%.
Given a t2.micro only has 1GB total, and each of those containers has a 1GB limit on its own, have you tried limiting the max memory usage on each container (as per http://docs.docker.com/engine/reference/run/#user-memory-constraints) such that the total memory limit doesn't exceed 1GB?
The biggest impact, which stopped the EC2 instance from falling over, was limiting the memory a docker container can use with the -m option per #palfrey's answer.
Some additionally tweaks were required to reduce the memory footprint and have the service respond to 15 concurrent users, albeit somewhat slowly. This included;
MySQL
Disabling performance_schema
Using a minimal config
WordPress
Disabling KeepAlive
Limiting servers:
<IfModule mpm_prefork_module>
StartServers 1
MinSpareServers 1
MaxSpareServers 3
MaxRequestWorkers 10
MaxConnectionsPerChild 3000
</IfModule>
Docker
I created some docker images that extend the default images to include these optimisations;
mysql-minimal
wordpress-minimal
Further details in my blog post.
Probably, a micro only has 1GB of ram. You can run this configuration without docker just fine, but you do have to adjust for memory limitations. Docker probably adds some overhead. Is there a reason for running both nginx and apache?
Generally you test and limit your threads to what the system can handle, there are probably things you can do with caching that will help improve performance. Apache, nginx, php-fpm all have settings that can control the number of threads that are allowed to be created.
How do I automatically restart a preemptible Google Compute Engine instance? I only have one instance that doesn't need 100% uptime but that I would like to restart once the data center becomes unloaded again. The instance/server that I'm trying to automatically restart has its own boot disk that I'd like to use each time it restarts.
You could try using Instance Group Manager to set up a pool of size 1. It will then try to re-create instances after they are preempted.
You should be aware that there is no guarantee that there is going to be capacity for your instance. As the docs say:
Preemptible instances are available from a finite amount of Compute Engine resources, and might not always be available.
You could create a f1-micro instance which is free for one instance per month in several data centers and create a cron job
*/10 * * * * /snap/bin/gcloud beta compute instances start --zone "yourzone" "yourinstance" --project "yourproject"
after you ran gcloud auth login once.
This will restart your instance every 10 minutes. Of course you can set this also to an hour or more. With a bit more scripting also things like exponential back off can be done.
If you'd like to restart it less frequently, you can use Instance schedules that's built in to the Google Cloud Dashboard.
https://cloud.google.com/compute/docs/instances/schedule-instance-start-stop