Trying to set vm.max_map_count with the node tuning operator and the openshift ClusterLogging operator. Openshift version is 4.9.17, cluster logging and elasticsearch operators are latest.
This is my tuned configuration:
apiVersion: tuned.openshift.io/v1
kind: Tuned
name: common-services-es
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- data: |
[main]
summary=Optimize systems running ES on OpenShift nodes
include=openshift-node
[sysctl]
vm.max_map_count=262144
name: common-services-es
recommend:
- match:
- label: component
type: pod
value: elasticsearch
priority: 5
profile: common-services-es
My ClusterLogging operator configuration is the default operator, and I can verify the labels component=elasticsearch on the pod.
Getting the pod logs with the following command
for p in `oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned -o=jsonpath='{range .items[*]}{.metadata.name} {end}'`; do printf "\n*** $p ***\n" ; oc logs pod/$p -n openshift-cluster-node-tuning-operator | grep applied; done
returns tuned.daemon.daemon: static tuning from profile 'common-services-es' applied on all 3 of my es nodes, but the elasticsearch pod still fails to start with the error max virtual memory areas vm.max_map_count [253832] is too low, increase to at least [262144] and running sysctl vm.max_map_count on the nodes confirm the value is 253832.
Turns out that IBM Cloud openshift doesn't use machineconfigs, and tuned uses machineconfigs.
Related
I was following the tutorial on https://cloud.google.com/tpu/docs/how-to.
I created a TPU instance, and tried to connect to it with gcloud compute ssh line. Then, this error occurred.
AppData\Local\Google\Cloud SDK>gcloud compute ssh node-1 --zone=asia-east1-c
PythonERROR: (gcloud.compute.ssh) Could not fetch resource:
- The resource 'projects/project-masker/zones/asia-east1-c/instances/node-1' was not found
Trying to solve this error, I found out that the tpus were not included in the execution group.
AppData\Local\Google\Cloud SDK>gcloud compute tpus list
PythonNAME ZONE ACCELERATOR_TYPE NETWORK RANGE STATUS
node-2 asia-east1-c v2-8 default 10.75.202.248/29 READY
node-1 asia-east1-c v2-8 default 10.82.81.168/29 READY
AppData\Local\Google\Cloud SDK>gcloud compute tpus execution-groups list
PythonListed 0 items.
This is what I got when I tried to restart the tpu.
PythonRequest issued for: [node-1]
Waiting for operation [projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-
e14800b7-d997be6b] to complete...done.
done: true
metadata:
'#type': type.googleapis.com/google.cloud.common.OperationMetadata
apiVersion: v1
cancelRequested: false
createTime: '2021-07-03T08:00:49.884674545Z'
endTime: '2021-07-03T08:01:31.161199334Z'
target: projects/project-masker/locations/asia-east1-c/nodes/node-1
verb: update
name: projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-e14800b7-d997be6b
response:
'#type': type.googleapis.com/google.cloud.tpu.v1.Node
acceleratorType: v2-8
apiVersion: V1
cidrBlock: 10.82.81.168/29
createTime: '2021-07-03T07:27:41.148997156Z'
health: HEALTHY
ipAddress: 10.82.81.170
name: projects/project-masker/locations/asia-east1-c/nodes/node-1
network: global/networks/default
networkEndpoints:
- ipAddress: 10.82.81.170
port: 8470
port: '8470'
schedulingConfig: {}
serviceAccount: service-...#cloud-tpu.iam.gserviceaccount.com
state: READY
tensorflowVersion: pytorch-1.9
I tried to find some related articles on google, but I couldn't find any. How can I fix this?
You can't SSH to a TPU node directly, so gcloud compute ssh {tpu_name} isn't expected to work.
You can, however, SSH directly to a TPU VM, please see this link. If you are already using TPU VM, then your issue is that you're trying
gcloud compute ssh
rather than
gcloud alpha compute tpus tpu-vm ssh ...
Using https://github.com/helm/charts/tree/master/stable/mysql (all the code is here), it is cool being able to run mysql as part of my local kubernetes cluster (using docker kubernetes).
The problem though is that once I stop running the pod, and then run the pod again, all the data that was stored is now gone.
My question is how do I keep the data that was added to the mysql pod? I have read about persistent volumes, and the mysql helm example from github is showing that it is using PersistentVolumeClaim. I have also enabled persistence on the values.yaml file, but I cannot seem to have the same data that was saved in the database.
My docker kubernetes version is currently 1.14.6.
Please verify your msql POD You should notice volumes and volumesMount options:
volumeMounts:
- mountPath: /var/lib/mysql
name: data
.
.
.
volumes:
- name: data
persistentVolumeClaim:
claimName: msq-mysql
In additions please verify your PersistentVolume and PersistentVolumeClaim, storageClass:
kubectl get pv,pvc,pods,sc:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-2c6aa172-effd-11e9-beeb-42010a840083 8Gi RWO Delete Bound default/msq-mysql standard 24m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/msq-mysql Bound pvc-2c6aa172-effd-11e9-beeb-42010a840083 8Gi RWO standard 24m
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/msq-mysql-b5c48c888-pz6p2 1/1 Running 0 4m28s 10.0.0.8 gke-te-1-default-pool-36546f4e-5rgw <none> <none>
Please run kubectl describe persistentvolumeclaim/msq-mysql (in your example you should change the pvc name)
You can notice that pvc was provisioned successfully using gce-pd and mounted by msq-mysql POD.
Normal ProvisioningSucceeded 26m persistentvolume-controller Successfully provisioned volume pvc-2c6aa172-effd-11e9-beeb-42010a840083 using kubernetes.io/gce-pd
Mounted By: msq-mysql-b5c48c888-pz6p2
I have created table with on row, deleted the pod and verified after that (as expected everything is alright):
mysql> SELECT * FROM t;
+------+
| c |
+------+
| ala |
+------+
1 row in set (0.00 sec)
Why: all the data that was stored is now gone.
As per helm chart docs:
The MySQL image stores the MySQL data and configurations at the /var/lib/mysql path of the container.
By default a PersistentVolumeClaim is created and mounted into that directory. In order to disable this functionality you can change the values.yaml to disable persistence and use an emptyDir instead.
Mostly there is problem with pv,pvc binding. It can be also problem with user defined or non default storageClass.
So please verify pv,pvc as stated above.
Take a look at StorageClass
A claim can request a particular class by specifying the name of a StorageClass using the attribute storageClassName. Only PVs of the requested class, ones with the same storageClassName as the PVC, can be bound to the PVC.
PVCs don’t necessarily have to request a class. A PVC with its storageClassName set equal to "" is always interpreted to be requesting a PV with no class, so it can only be bound to PVs with no class (no annotation or one set equal to ""). A PVC with no storageClassName is not quite the same and is treated differently by the cluster, depending on whether the DefaultStorageClass admission plugin is turned on.
Observed behavior
I started with one node Openshift cluster and it successfully deployed master/node and gluster volume. Now I extend Openshift cluster and it was successfully.
but on extending glusterfs volume with below
[glusterfs]
10.1.1.1 glusterfs_devices='[ "/dev/vdb" ]'
10.1.1.2 glusterfs_devices='[ "/dev/vdb" ]' openshift_node_labels="type=upgrade"
ansible-playbook -i inventory2.ini /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml -e openshift_upgrade_nodes_label="type=upgrade"
it only added 10.1.1.2 as peer but volume still has only one brick
Following customization done to start deploy gluster from 1 node {--durability none}
openshift-ansible/roles/openshift_storage_glusterfs/tasks/heketi_init_db.yml
- name: Create heketi DB volume
command: "{{ glusterfs_heketi_client }} setup-openshift-heketi-storage --image {{ glusterfs_heketi_image }} --listfile /tmp/heketi-storage.json **--durability none**"
register: setup_storage
>gluster peer status
Number of Peers: 1
Hostname: 10.1.1.2
Uuid: 1b8159e4-99e2-4f4d-ad95-e97bc8655d32
State: Peer in Cluster (Connected)
gluster volume info
Volume Name: heketidbstorage
Type: Distribute
Volume ID: 769419b9-d28f-4cdd-a8f3-708b6b738f65
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.1.1.1:/var/lib/heketi/mounts/vg_4187bfa3eb090ceffea9c53b156ddbd4/brick_80401b43be8c3c8a74417b18ad574524/brick
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
Expected/desired behavior
I am expecting that on addition of every new node it should create new brick too
Details on how to reproduce (minimal and precise)
Add nodes in gluster cluster with below commands
ansible-playbook -i inventory2.ini /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml -e openshift_upgrade_nodes_label="type=upgrade"
Information about the environment:
Heketi version used (e.g. v6.0.0 or master): OpenShift 3.10
Operating system used: CentOS
Heketi compiled from sources, as a package (rpm/deb), or container: Container
If container, which container image: docker.io/heketi/heketi:latest
Using kubernetes, openshift, or direct install: Openshift
If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: outside
If kubernetes/openshift, how was it deployed (gk-deploy, openshift-ansible, other, custom): openshift-ansible
Just adding a node/server does not mean that the brick will also be added to existing
gluster volume.
You have to add that brick, hosted on new node, to existing volume.
command -
"gluster volume add-brick host:brick-path commit force"
Not sure if you have provided this command in your automation script or not.
I have set up my Kubernetes 1.3.4 cluster on GCE with
export KUBE_ENABLE_CLUSTER_MONITORING=google
This works quite nicely, I get application logs (for some reason in the Container Engine section, but well) and also pod and node metrics.
The only thing that is missing are the node memory metrics, only CPU is shown (see screenshot)
No memory metrics
In the heapster logs I see tons of lines like this
{
metadata: {
severity: "ERROR"
projectId: "<project-id>"
serviceName: "container.googleapis.com"
zone: "europe-west1-d"
labels: {
container.googleapis.com/cluster_name: "production"
compute.googleapis.com/resource_type: "instance"
compute.googleapis.com/resource_name: "fluentd-cloud-logging-production-minion-group-p0w8"
container.googleapis.com/instance_id: "6772154497331326454"
container.googleapis.com/pod_name: "heapster-v1.1.0-2102007506-23b3e"
compute.googleapis.com/resource_id: "6772154497331326454"
container.googleapis.com/stream: "stderr"
container.googleapis.com/namespace_name: "kube-system"
container.googleapis.com/container_name: "heapster"
}
timestamp: "2016-09-13T14:40:08.000Z"
projectNumber: "930564692351"
}
textPayload: "E0913 14:40:08.665035 1 gcm.go:179] Error while sending request to GCM googleapi: Error 400: Timeseries 76, point: start is not older than end, for a cumulative metric, invalidParameter
"
insertId: "pt5bo7g132r266"
log: "heapster"
}
Not sure if this is related.
Any ideas?
If you are running your cluster using GCE instead of GKE
You should install the stackdriver agent and verify the credentials that agent is using to communicate with stackdriver link
If you are using linux you can install the agent by executing:
curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh
sudo bash install-monitoring-agent.sh
and you can check your credentials running the following command:
sudo cat $GOOGLE_APPLICATION_CREDENTIALS
sudo cat /etc/google/auth/application_default_credentials.json
I currently try to switch from the "Container-Optimized Google Compute Engine Images" (https://cloud.google.com/compute/docs/containers/container_vms) to the "Container-VM" Image (https://cloud.google.com/compute/docs/containers/vm-image/#overview). In my containers.yaml, I define a volume and a container using the volume.
apiVersion: v1
kind: Pod
metadata:
name: workhorse
spec:
containers:
- name: postgres
image: postgres:9.5
imagePullPolicy: Always
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
volumes:
- name: postgres-storage
gcePersistentDisk:
pdName: disk-name
fsType: ext4
This setup worked fine with the "Container-Optimized Google Compute Engine Images", however fails with the "Container-VM". In the logs, I can see the following error:
May 24 18:33:43 battleship kubelet[629]: E0524 18:33:43.405470 629 gce_util.go:176]
Error getting GCECloudProvider while detaching PD "disk-name":
Failed to get GCE Cloud Provider. plugin.host.GetCloudProvider returned <nil> instead
Thanks in advance for any hint!
This happens only when kubelet is run without the --cloud-provider=gce flag. The problem, unless is something different, is dependant on how GCP is launching Container-VMs.
Please contact with google cloud platform guys.
Note if this happens to you when using GCE: Add --cloud-provider=gce flag to kubelet in all your workers. This only applies to 1.2 cluster versions because, if i'm not wrong, there is an ongoing attach/detach design targeted for 1.3 clusters which will move this business logic out of kubelet.
In case someone is interested in the attach/detach redesign here it is its corresponding github issue: https://github.com/kubernetes/kubernetes/issues/20262