How to resolve random invalid genesis block in ST 1.1.2? - hyperledger-sawtooth

I installed ST 1.1.2 with docker and I get inconsistent results. Sometimes the genesis block is generated just fine and I can submit new batches/blocks. On the other hand, sometimes the devmode processor doesn't process the genesis block because of "failed" consensus data.
I ran this on several AWS Ubuntu 16.04 instances, with the same random behavior. I also installed 1.1.2 natively with Ubuntu but got the same problem, only consistently.
sudo docker-compose -f sawtooth-default.yaml up
devmode_engine_rust: | Received message: BlockNew(Block(block_num: 0 ...*
devmode_engine_rust: | Checking consensus data: Block(block_num: 0 ...*
devmode_engine_rust: | Failed consensus check: Block(block_num: 0 ...*
devmode_engine_rust: | Failing block [86, ...*
sudo docker-compose -f sawtooth-defualt.yaml down
sudo docker-compose -f sawtooth-defualt.yaml up
I expect the genesis block to be validated everythime, instead of it happening around 50% of the time, even though I don't change anything in how I start the docker containers. What do I need to do in order for the devmode consensus process to always accept the genesis block at the first try?

It seems like a timing issue. Is there a dependency on the validator container from the consensus container. For example,
devmode-engine:
image: hyperledger/sawtooth-devmode-engine-rust:1.1
ports:
- '5050:5050'
container_name: sawtooth-devmode-engine-rust-default
depends_on:
- validator
entrypoint: devmode-engine-rust --connect tcp://validator:5050
(From
https://github.com/danintel/sawtooth-cookiejar/blob/master/docker-compose.yaml )
Also, does the validator container run all the initialization commands (sawadm, sawtooth keygen, sawset genesis, sawadm genesis) before sawtooth-validator starts?

Related

mkdir /.gitlab-runner: permission denied running GitLab Runner in Kubernetes deployed via Helm

I'm trying to deploy the GitLab Runner (15.7.1) onto an on-premise Kubernetes cluster and getting the following error:
PANIC: loading system ID file: saving system ID state file: creating directory: mkdir /.gitlab-runner: permission denied
This is occurring with both the 15.7.1 image (Ubuntu?) and the alpine3.13-v15.7.1 image. Looking at the deployment, it looks likes it should be trying to use /home/gitlab-runner, but for some reason it is trying to use root (/), which is a protected directory.
Anyone else experience this issue or have a suggestion as to what to look at?
I am using the Helm chart (0.48.0) using a copy of the images from dockerhub (simply moved into a local repository as internet access is not available from the cluster). Connectivity to GitLab appears to be working, but the error causes the overall startup to fail. Full logs are:
Registration attempt 4 of 30
Runtime platform arch=amd64 os=linux pid=33 revision=6d480948 version=15.7.1
WARNING: Running in user-mode.
WARNING: The user-mode requires you to manually start builds processing:
WARNING: $ gitlab-runner run
WARNING: Use sudo for system-mode:
WARNING: $ sudo gitlab-runner...
Created missing unique system ID system_id=r_Of5q3G0yFEVe
PANIC: loading system ID file: saving system ID state file: creating directory: mkdir /.gitlab-runner: permission denied
I have tried the 15.7.1 image, the alpine3.13-v15.7.1 image, and the gitlab-runner-ocp:amd64-v15.7.1 image and searched the values.yaml for anything relevant to the path. Looking at the deployment template, it appears that it ought to be using /home/gitlab-runner as the directory (instead of /) [though the docs suggested it was /home].
As for "what was I expecting", of course I was expecting that it would "just work" :)
So, resolved this (and other) issues with:
Updated helm deployment template to mount an empty volume at /.gitlab-runner
[separate issue] explicitly added builds_dir and environment [per gitlab-org/gitlab-runner#3511 (comment 114281106)].
These two steps appeared to be sufficient to get the Helm chart deployment working.
You can easily create and mount the emptyDir (in case you are creating gitlab-runner with kubernetes manifest *.yml file):
volumes:
- emptyDir: {}
name: gitlab-runner
volumeMounts:
- name: gitlab-runner
mountPath: /.gitlab-runner
-------------------- OR --------------------
volumeMounts:
- name: root-gitlab-runner
mountPath: /.gitlab-runner
volumes:
- name: root-gitlab-runner
emptyDir:
medium: "Memory"

Unable to connect worker node to master using K3S

I am trying to setup a K3S cluster for learning purposes but I am having trouble connecting the master node with agents. I have looked several tutorials and discussions on this but I can't find a solution. I know I am probably missing something obvious (due to my lack of knowledge), but still help would be much appreciated.
I am using two AWS t2.micro instances with default configuration.
When ssh into the master and installed K3S using
curl -sfL https://get.k3s.io | sh -s - --no-deploy traefik --write-kubeconfig-mode 644 --node-name k3s-master-01
with kubectl get nodes, I am able to see the master
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 13s v1.23.6+k3s1
So far it seems I am doing things right. From what I understand, I am supposed to configure the kubeconfig file. So, I accessed it by using
cat /etc/rancher/k3s/k3s.yaml
I copied the configuration file and the server info to match the private IP I took from AWS console, resulting in something like this
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: <lots_of_info>
server: https://<master_private_IP>:6443
name: default
contexts:
- context:
cluster: default
user: default
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
user:
client-certificate-data: <my_certificate_data>
client-key-data: <my_key_data>
Then, I ran vi ~/.kube/config, and there I pasted the kubeconfig file
Finally, I grabbed the token with cat /var/lib/rancher/k3s/server/node-token, ssh into the other machine and then run the following
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-01 K3S_URL=https://<master_private_IP>:6443 K3S_TOKEN=<master_token> sh -
The output is
[INFO] Finding release for channel stable
[INFO] Using v1.23.6+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/sha256sum-amd64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/k3s
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Creating /usr/local/bin/ctr symlink to k3s
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s-agent.service
[INFO] systemd: Enabling k3s-agent unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s-agent.service → /etc/systemd/system/k3s-agent.service.
[INFO] systemd: Starting k3s-agent
By this output, it looks like I have created an agent. However, when I run kubectl get nodes in the master, I still get
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 12m v1.23.6+k3s1
What is the thing I was supposed to do in order to get the agent connected to the master? I am guess I am probably missing something simple, but I just can't seem to find the solution. I've read all the documentation but it is still not clear to me where I am making the mistake. I've tried saving the private master IP and token into the agent as environmental variables with export K3S_TOKEN=master_token and K3S_URL=master_private_IP and then simply running curl -sfL https://get.k3s.io | sh - but I still can't see the worker nodes when running kubectl get nodes
Any help would be appreciated.
It might be your VM instance firewall that prevents appropriate connection from your master to the worker node (and vice versa). Official rancher documentation advise to disable firewall for (Red Hat/CentOS) Enterprise Linux:
It is recommended to turn off firewalld:
systemctl disable firewalld --now
If enabled, it is required to disable nm-cloud-setup and reboot the node:
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer reboot
If you are using Ubuntu on your VM's, there is a different firewall tool (ufw).
In my case, allowing 6443 and 443(not sure if required) port TCP connections worked fine.
Allow port 6443 and TCP connection in all of your cluster machines:
sudo ufw allow 6443/tcp
Then apply k3s installation script in your worker node(s):
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-1 K3S_URL=https://<k3s-master-1 IP>:6443 K3S_TOKEN=<k3s-master-1 TOKEN> sh -
This should work. If not, you can try adding additional allow rule for 443 tcp port as well.
A few options to check.
Check Journalctl for errors
journalctl -u k3s-agent.service -n 300 -xn
If using RaspberryPi for a worker node, make sure you have
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
as the very end of your /boot/cmdline.txt file. DO NOT PUT THIS VALUE ON A NEW LINE! Should just be appended to the end of the line.
If your master node(s) have self-signed certs, make sure you copy the master node's self signed cert to your worker node(s). In linux or raspberry pi copy cert to /usr/local/share/ca-certificates, then issue an
sudo update-ca-certificates
on the worker node
Don't forget to reboot the worker node after you make these changes!
Hope this helps someone!

Kubernetes google cloud composer with gitlab ci yaml file

I am working on the deployment of a gitlab CI pipeline to trigger a google cloud composer DAG
Below is the .yaml I wrote :
stages:
- deploy
deploy:
stage: deploy
image: google/cloud-sdk
script:
- apt-get update && apt-get --only-upgrade install kubectl google-cloud-sdk
- gcloud config set project $GCP_PROJECT_ID
- gsutil cp plugins/*.py ${PLUGINS_BUCKET}
- gsutil cp dags/*.py ${DAGS_BUCKET}
- kubectl get pods
- gcloud composer environments run ${COMPOSER_ENVIRONMENT} --location ${ENVIRONMENT_LOCATION} trigger_dag -- ${DAG_NAME}
Unfortunately, the execution of the pipleine fails with the error below :
$ gcloud config set project $GCP_PROJECT_ID
Updated property [core/project].
$ gsutil cp plugins/*.py ${PLUGINS_BUCKET}
Copying file://plugins/dataproc_custom_operators.py [Content-Type=text/x-python]...
/ [0 files][ 0.0 B/ 2.3 KiB]
/ [1 files][ 2.3 KiB/ 2.3 KiB]
Operation completed over 1 objects/2.3 KiB.
$ gsutil cp dags/*.py ${DAGS_BUCKET}
copying file://dags/frrm_infdeos_workflow.py [Content-Type=text/x-python]...
/ [0 files][ 0.0 B/ 3.3 KiB]
/ [1 files][ 3.3 KiB/ 3.3 KiB]
Operation completed over 1 objects/3.3 KiB.
$ gcloud composer environments run ${COMPOSER_ENVIRONMENT} --location ${ENVIRONMENT_LOCATION} trigger_dag -- ${DAG_NAME}
kubeconfig entry generated for europe-west1-nameenvironment-a5456e0c-gke.
ERROR: (gcloud.composer.environments.run) No running GKE pods found. If the environment was recently started, please wait and retry.
ERROR: Job failed: command terminated with exit code 1
Do you have any idea about how to fix this please ?
Best regards
I had the same problem as #scalacode. For me, the solution was that the gitlab-runner was running in a different GCP Project than the Composer Environment, so it failed without specifying that error. Running a gitlab-runner in the same project as the Composer Environment fixed the issue.
It seems Composer is unable to retrieve information about the pods/GKE cluster. This could be for a number of reasons ranging from the GKE cluster not creating the nodes to the pods being in a crash loop.
I notice in the script you did not “get-credentials” to authenticate to the cluster. When running commands on a GKE cluster through CLI, traditionally you would first have to authenticate to the cluster first with command. To do this with composer:
gcloud composer environments describe ${COMPOSER_ENVIRONMENT} --location ${ENVIRONMENT_LOCATION} --format="get(config.gkeCluster)"
This will return something of the form: projects/PROJECT/zones/ZONE/clusters/CLUSTER Then run:
gcloud container clusters get-credentials ${CLUSTER} --zone ${ZONE}
Once you have authenticated to the cluster in the script, see if it is now able to complete. If not, try running kubectl get pods to see what is happening with the pods/if they exist.
If you see many pods restarting or generally not in the “running/completed” state, the issue could be with the pod configuration.
If you don’t see pods at all, the deployment may have failed. Check the deployment with command kubectl get deployments.
The deployments airflow-scheduler, airflow-sqlproxy, & airflow-worker should be present. If those three deployments are not present, the environment was likely tampered with, & it would be easiest to make a new environment.

Scaling Up of GlusterFS-storage only add new peer without new bricks in Openshift

Observed behavior
I started with one node Openshift cluster and it successfully deployed master/node and gluster volume. Now I extend Openshift cluster and it was successfully.
but on extending glusterfs volume with below
[glusterfs]
10.1.1.1 glusterfs_devices='[ "/dev/vdb" ]'
10.1.1.2 glusterfs_devices='[ "/dev/vdb" ]' openshift_node_labels="type=upgrade"
ansible-playbook -i inventory2.ini /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml -e openshift_upgrade_nodes_label="type=upgrade"
it only added 10.1.1.2 as peer but volume still has only one brick
Following customization done to start deploy gluster from 1 node {--durability none}
openshift-ansible/roles/openshift_storage_glusterfs/tasks/heketi_init_db.yml
- name: Create heketi DB volume
command: "{{ glusterfs_heketi_client }} setup-openshift-heketi-storage --image {{ glusterfs_heketi_image }} --listfile /tmp/heketi-storage.json **--durability none**"
register: setup_storage
>gluster peer status
Number of Peers: 1
Hostname: 10.1.1.2
Uuid: 1b8159e4-99e2-4f4d-ad95-e97bc8655d32
State: Peer in Cluster (Connected)
gluster volume info
Volume Name: heketidbstorage
Type: Distribute
Volume ID: 769419b9-d28f-4cdd-a8f3-708b6b738f65
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.1.1.1:/var/lib/heketi/mounts/vg_4187bfa3eb090ceffea9c53b156ddbd4/brick_80401b43be8c3c8a74417b18ad574524/brick
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
Expected/desired behavior
I am expecting that on addition of every new node it should create new brick too
Details on how to reproduce (minimal and precise)
Add nodes in gluster cluster with below commands
ansible-playbook -i inventory2.ini /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml -e openshift_upgrade_nodes_label="type=upgrade"
Information about the environment:
Heketi version used (e.g. v6.0.0 or master): OpenShift 3.10
Operating system used: CentOS
Heketi compiled from sources, as a package (rpm/deb), or container: Container
If container, which container image: docker.io/heketi/heketi:latest
Using kubernetes, openshift, or direct install: Openshift
If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: outside
If kubernetes/openshift, how was it deployed (gk-deploy, openshift-ansible, other, custom): openshift-ansible
Just adding a node/server does not mean that the brick will also be added to existing
gluster volume.
You have to add that brick, hosted on new node, to existing volume.
command -
"gluster volume add-brick host:brick-path commit force"
Not sure if you have provided this command in your automation script or not.

Openshift: Error pulling image from remote, secure docker registry using certificates

I use the all-in-one VM of Openshift origin.
I am trying to pull images from a private, secure registry using an Image Stream. This is the ImageStream definition:
apiVersion: v1
kind: ImageStream
metadata:
name: my-image-stream
annotations:
description: Keeps track of changes in the application image
name: my-image
spec:
dockerImageRepository: "my.registry.net/myproject/my-image"
The repository is secured with a certificate. On my local machine, i have them in /etc/docker/certs.d/my.registry.net and I can login with docker login my.registry.net.
When I run oc import-image, however, I get the following error:
The import completed with errors.
Name: my-image
Namespace: myproject
Created: About an hour ago
Labels: <none>
Description: Keeps track of changes in the application image
Annotations: openshift.io/image.dockerRepositoryCheck=2017-01-27T08:09:49Z
Docker Pull Spec: 172.30.53.244:5000/myproject/my-image
Unique Images: 0
Tags: 1
latest
tagged from my.registry.net/myproject/my-image
! error: Import failed (InternalError): Internal error occurred: Get https://my.registry.net/v2/: remote error: handshake failure
About an hour ago
I have copied the certificates to the vagrant machine and restarted the docker daemon, but the problem remains. I have not found any documentation on how to properly add the certificates, so I just put them in the usual docker folder.
What is the appropriate way to make this work?
Update in response to rezie's answer:
There is no file etc/origin/master/ca-bundle.crt on my vagrant box. I found the following ca-bundle.crt files :
$ find / -iname ca-bundle.crt
/etc/pki/tls/certs/ca-bundle.crt
##multiple lines like
/var/lib/docker/devicemapper/mnt/something-hash-like/rootfs/etc/pki/tls/certs/ca-bundle.crt
/var/lib/origin/openshift.local.config/master/ca-bundle.crt
I appended the root certificate to /etc/pki/tls/certs/ca-bundle.crt and to var/lib/origin/openshift.local.config/master/ca-bundle.crt, but that did not change anything.
Please note, however, that I do not need to have this root certificate in /etc/docker/certs.d/... in order to login directly using docker login my.registry.net
I have appended
I cannot comment due tow lo karma so I'll write an answer saying almost the same as rezie.
The error:
! error: Import failed (InternalError): Internal error occurred: Get https://my.registry.net/v2/: remote error: handshake failure
About an hour ago
Comes from OpenShift, not from docker, therefore adding it to /etc/docker/certs.d/my.registry.net doesn't prevent the error from happening.
You should add the CA certificate at OS level, my guess is the steps failed for some reason so do it this way:
openssl s_client -connect my.registry.net:443 </dev/null |
sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' \
> /etc/pki/ca-trust/source/anchors/my.registry.net.crt &&
update-ca-trust check && update-ca-trust extract
Finally test if it worked running
curl https://my.registry.net/v2
If it doesn't give you a certificate error and you still can't do the oc import restart the atomic-openshift-master-api service
Try appending your CA (the same one you said you said that was used in the my.registry.net directory) into Openshift's ca bundle (e.g. /etc/origin/master/ca-bundle.crt. Then restart the service and reattempt import-image (making sure that you do not include the --insecure flag).
For reference, check out this issue from the Origin project. As you've mentioned, there's currently no way to supply certificates along with the dockercfg secret, and the suggestion from that issue is to add the CA as a trusted root CA across all the hosts.