Openshift: Why can't I add a node to a district? - openshift

I'm trying to add a node to a district:
[root#broker ~]# oo-admin-ctl-district -c add-node -n small_district -i node1.example.com
ERROR OUTPUT:
Node with server identity: node1.example.com is of node profile '' and needs to be 'small' to add to district 'small_district'
But, when I go to the node, it seems to know that it should be a small:
[root#node1 ~]# grep -i profile /etc/mcollective/facts.yaml
node_profile: small
I ran oo-diagnostics on the broker and got:
[root#broker ~]# oo-diagnostics
FAIL: test_node_profiles_districts_from_broker
No node hosts found. Please install some,
or ensure the existing ones respond to 'mco ping'.
OpenShift cannot host gears without at least one node host responding.
FAIL: run_script
oo-accept-systems -w 2 had errors:
--BEGIN OUTPUT--
FAIL: No node hosts responded. Run 'mco ping' and troubleshoot if this is unexpected.
1 ERRORS
But mco ping shows no problems:
[root#broker ~]# mco ping
node1.example.com time=106.82 ms
---- ping statistics ----
1 replies max: 106.82 min: 106.82 avg: 106.82
I also found https://lists.openshift.redhat.com/openshift-archives/users/2013-November/msg00006.html, which lists the same error message. However, I already have everything in /etc/mcollective/facts.yaml that the thread suggets:
[root#node1 ~]# grep 'node_profile' /etc/mcollective/facts.yaml
node_profile: small
What could be preventing the node from being added to the district?

The problem was a misconfiguration of the node. https://bugzilla.redhat.com/show_bug.cgi?id=1064977 should resolve the documentation issue that led to this.
Resolution was to update /etc/mcollective/server.cfg:
[root#node1 ~]# git diff --color /etc/mcollective/server.cfg.old /etc/mcollective/server.cfg
diff --git a/etc/mcollective/server.cfg.old b/etc/mcollective/server.cfg
index c614ed9..fff36c5 100644
--- a/etc/mcollective/server.cfg.old
+++ b/etc/mcollective/server.cfg
## -22,4 +22,5 ## plugin.activemq.pool.1.password = marionette
# Facts
factsource = yaml
-plugin.yaml = /opt/rh/ruby193/root/etc/mcollective/facts.yaml
+plugin.yaml = /etc/mcollective/facts.yaml

Related

Openshift 4.4 - cannot 'oc logs\exec' pods running on worker nodes

Openshift 4.4.17 cluster (3 masters and 3 workers).
Getting error when trying to see logs (or exec terminal) on those pods running on worker nodes. The same applies for Openshift GUI. No issues when trying to do the same for pods running on master nodes.
Example 1: pods running on worker
$ oc whoami
kube:admin
$ oc get pod -n lamp
NAME READY STATUS RESTARTS AGE
lamp-lamp-6c7d9f467d-jsn4t 3/3 Running 0 108d
$ oc logs lamp-lamp-6c7d9f467d-jsn4t httpd -n lamp
error: You must be logged in to the server (the server has asked for the client to provide credentials ( pods/log lamp-lamp-6c7d9f467d-jsn4t))
Example 2: pods running on master nodes
$ oc get pod -n openshift-apiserver
NAME READY STATUS RESTARTS AGE
apiserver-6d64545f-5lmb8 1/1 Running 0 2d19h
apiserver-6d64545f-hktqd 1/1 Running 0 2d19h
apiserver-6d64545f-kb4qt 1/1 Running 0 2d19h
$ oc logs apiserver-6d64545f-5lmb8 -n openshift-apiserver
Copying system trust bundle
I0225 20:41:39.989689 1 requestheader_controller.go:243] (..output omitted..)
Investigating kubelet on worker nodes:
On every worker node kubelet service is running, but
journalctl -u kubelet
shows these two lines:
Unable to authenticate the request due to an error: x509: certificate signed by unknown authority
logging error output: "Unauthorized"
About kubeconfig on worker nodes:
Watching the content of /etc/kubernetes/kubeconfig file.
- kubelet connects to api-server --> https://api-int.ocs-cls1.mycompany.lab
- the server passes valid certificate signed by --> kube-apiserver-lb-signer
- certificate-authority-data carries --> kube-apiserver-lb-signer rootCA
The kubeconfig looks like correct.
UPDATE:
Also noticed these log lines reporting bad certificate:
$ oc -n openshift-apiserver logs apiserver-6d64545f-5lmb8
log.go:172] http: TLS handshake error from 10.128.0.12:47078: remote error: tls: bad certificate
...
UPDATE2:
Also checked apiserver-loopback-client certificate:
$ curl --resolve apiserver-loopback-client:6443:{IP_MASTER} -v -k https://apiserver-loopback-client:6443/healthz
server certificate verification SKIPPED
* server certificate status verification SKIPPED
* common name: apiserver-loopback-client#1614330374 (matched)
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: CN=apiserver-loopback-client#1614330374
* start date: Fri, 26 Feb 2021 08:06:13 GMT
* expire date: Sat, 26 Feb 2022 08:06:13 GMT
* issuer: CN=apiserver-loopback-client-ca#1614330374
try this
while :;do
sleep 2
oc get csr -o name | xargs -r oc adm certificate approve
done
use the another terminal, and ssh to the any master node, run this:
crictl ps -a | awk '/Running/&&/-cert-syncer/{print $1}' | xargs -r crictl stop

Orion Context Broker functional test failure

I have successfully forked and built the Context Broker source code on a CentOS 6.9 VM and now I am trying to run the functional tests as the official documentation suggests. First, I installed the accumulator-server.py script:
$ make install_scripts INSTALL_DIR=~
Verified that it is installed:
$ accumulator-server.py -u
Usage: accumulator-server.py --host <host> --port <port> --url <server url> --pretty-print -v -u
Parameters:
--host <host>: host to use database to use (default is '0.0.0.0')
--port <port>: port to use (default is 1028)
--url <server url>: server URL to use (default is /accumulate)
--pretty-print: pretty print mode
--https: start in https
--key: key file (only used if https is enabled)
--cert: cert file (only used if https is enabled)
-v: verbose mode
-u: print this usage message
And then run the functional tests:
$ make functional_test INSTALL_DIR=~
But the test fails and exits with the message below:
024/927: 0000_ipv6_support/ipv4_ipv6_both.test ........................................................................ (FAIL 11 - SHELL-INIT exited with code 1) testHarness.sh/IPv6 IPv4 Both : (0000_ipv6_support/ipv4_ipv6_both.test)
make: *** [functional_test] Error 11
$
I checked the file ../0000_ipv6_support/ipv4_ipv6_both.shellInit.stdout for any hint on what may be going wrong but error log does not lead me anywhere:
{ "dropped" : "ftest", "ok" : 1 }
accumulator running as PID 6404
Unable to start listening application after waiting 30
Does anyone have any idea about what may be going wrong here?
I checked the script which prints the error line Unable to start listening application after waiting 30 and noticed that stderr for accumulator-server.py is logged into the /tmp folder.
The accumulator_9977_stderr file had this log: 0000_ipv6_support/ipv4_ipv6_both.shellInit: line 27: accumulator-server.py: command not found
Once I saw this log I understood the mistake I made. I was running the
functional tests with sudo and the secure_path was being used instead of my PATH variable.
So at the end, running the functional tests with the command below solved the issue for me.
$ sudo "PATH=$PATH" make functional_test INSTALL_DIR=~
This can also be solved by editing the /etc/sudoers file by:
$ sudo visudo
and modifying the secure_path value.

net-snmp ubuntu - snmptrapd doesn't log in mysql

snmptrapd doesn't log in mysql
ISSUE - net-snmp does not log traps into the mysql database - Installed on Ubuntu
Net-snmp was configured with the following as per the tutorial - http://www.net-snmp.org/wiki/index.php/Net-Snmp_on_Ubuntu
I configured snmpdtrapd as mentioned on the following page.
http://www.net-snmp.org/wiki/index.php/Snmptrapd
My mysql installation was running with no issues, however it did not contain mysql_config file - so I ran the following install
sudo apt-get install libmysqlclient-dev – will get mysql_config file
Mysql continues to run with no issues
net-snmp configuration was run with the following command successfully
./configure --with-defaults --with-mysql
the config output showed that mysql logging was enabled.
cat snmptrapd.conf ---------------
authCommunity log public
# maximum number of traps to queue before forced flush
# set to 1 to immediately write to the database
sqlMaxQueue 1
# seconds between periodic queue flushes
sqlSaveInterval 1
cat snmpd.conf - contains as its line1 & line 2 -------------------
rwcommunity public localhost
linux#lin-850:~$ cat my.cnf
[snmptrapd]
user=root
password=qbcdfee
host=localhost
The following command runs well with appropriate output
snmpwalk -v 1 -c public localhost
db schema was made as per - /net-snmp-5.7.3/dist/schema-snmptrapd.sql
Where did I go wrong - pls help. Thanks in advance
regs
George

userparameters and ZBX_NOTSUPPORTED

I want to ping an external ip from all of my servers that run zabbix agent.
I searched and find some articles about zabbix user parameters.
In /etc/zabbix/zabbix_agentd.conf.d/ I created a file named userparameter_ping.conf with following content:
UserParameter=checkip[*],ping -c4 8.8.8.8 && echo 0 || echo 1
I created an item named checkip in zabbix server with a graph but got no data. After some another digging I found zabbix_get and tested my userparameter but I got the error : ZBX_NOTSUPPORTED
# zabbix_get -s 172.20.4.43 -p 10050 -k checkip
my zabbix version :
Zabbix Agent (daemon) v2.4.5 (revision 53282) (21 April 2015)
Does anybody know what I can do to address this?
After some change and talks with folks in mailing list finally it worked but how :
first i created a file in :
/etc/zabbix/zabbix_agentd.conf.d/
and add this line :
UserParameter=checkip[*],ping -W1 -c2 $1 >/dev/null 2>&1 && echo 0 || echo 1
and run this command :
./sbin/zabbix_agentd -t checkip["8.8.8.8"]
checkip[8.8.8.8] [t|0]
so everything done but Timeout option is very important for us :
add time out in /etc/zabbix/zabbix_agentd.conf
Timeout=30
Timeout default is 3s so if we run
time ping -W1 -c2 8.8.8.8
see maybe it takes more than 3s so you got error :
ZBX_NOTSUPPORTED
It can be anything. For example timeout - default timeout is 3 sec and ping -c4 requires at least 3 seconds, permission/path to ping, not restarted agent, ...
Increase debug level, restart agent and check zabbix logs. Also you can test zabbix_agentd directly:
zabbix_agentd -t checkip[]
[m|ZBX_NOTSUPPORTED] [Timeout while executing a shell script.] => Timeout problem. Edit zabbix_agentd.conf and increase Timeout settings. Default 3 seconds are not the best for your ping, which needs 3+ seconds.
If you need more than 30s for the execution, you can use the nohup (command..) & combo to curb the timeout restriction.
That way, if you generate some file with the results, in the next pass, you can read the file and get back the results without any need to wait at all.
For those who may be experiencing other issues with the same error message.
It is important to run zabbix_agentd with the -c parameter:
./sbin/zabbix_agentd -c zabbix_agentd.conf --test checkip["8.8.8.8"]
Otherwise zabbix might not pick up on the command and will thus yield ZBX_NOTSUPPORTED.
It also helps to isolate the command into a script file, as Zabbix will butcher in-line commands in UserParameter= much more than you'd expect.
I defined two user parameters like this for sync checking between to samba DCs.
/etc/zabbix/zabbix_agentd.d/userparameter_samba.conf:
UserParameter=syncma, sudo samba-tool drs replicate smb1 smb2 cn=schema,cn=configuration,dc=domain,dc=com
UserParameter=syncam, sudo samba-tool drs replicate smb2 smb1 cn=schema,cn=configuration,dc=domain,dc=com
and also provided sudoer access for Zabbix user to execute the command. /etc/sudoers.d/zabbix:
Defaults:zabbix !syslog
Defaults:zabbix !requiretty
zabbix ALL=(ALL) NOPASSWD: /usr/bin/samba-tool
zabbix ALL=(ALL) NOPASSWD: /usr/bin/systemctl
And "EnableRemoteCommands" is enabled on my zabbix_aganetd.conf, sometimes when I run
zabbix_get -s CLIENT_IP -p10050 -k syncma or
zabbix_get -s CLIENT_IP -p10050 -k syncam
I get the error ZBX_NOTSUPPORTED: Timeout while executing a shell script.
but after executing /sbin/zabbix_agentd -t syncam on the client, Zabbix server just responses normally.
Replicate from smb2 to smb1 was successful.
and when it has a problem I get below error on my zabbix.log
failed to kill [ sudo samba-tool drs replicate smb1 smb2 cn=schema,cn=configuration,dc=domain,dc=com]: [1] Operation not permitted
It seems like it is a permission error! but It just resolved after executing /sbin/zabbix_agentd -t syncam but I am not sure the error is gone permanently or will happen at the next Zabbix item check interval.

MySQL Cluster - [ [ndbd] ERROR -- Couldn't start as daemon, error: 'Failed to open logfile ]

recently I want to set up mysql cluster, one Mgmt node, one sql node and two data node,
it seems successfully installed and Mgmt node started, but when I try to start data node, I hit a problem...
here is the error message when I try to start data node:
Does anyone know what's going wrong?
basically I follow the step by step tutorial on this site and this site
It would be very appreciated if you can give me some advice!
thanks
Okay, I came up with a solution to fix this issue : 013-01-18 09:26:10 [ndbd] ERROR -- Couldn't start as daemon, error: 'Failed to open logfile
I was stuck with the same issue and after exploring I opened the $MY_CLUSTER_INSTALLATION/ndb_data/ndb_1_cluster.log
1.I found the following message present in the log:
2013-01-18 09:24:50 [MgmtSrvr] INFO -- Got initial configuration
from 'conf/config.ini',
will try to set it when all ndb_mgmd(s) started
2013-01-18 09:24:50 [MgmtSrvr] INFO -- Node 1: Node 1 Connected
2013-01-18 09:24:54 [MgmtSrvr] ERROR -- Unable to bind management
service port: *:1186!
Please check if the port is already used,
(perhaps a ndb_mgmd is already running),
and if you are executing on the correct computer
2013-01-18 09:24:54 [MgmtSrvr] ERROR -- Failed to start mangement service!
2.I checked the services running on port on my Mac machine using following command:
lsof -i :1186
And sure enough, I found the ndb_mgmd(s):
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ndb_mgmd 418 8u IPv4 0x33a882b4d23b342d 0t0 TCP *:mysql-cluster (LISTEN)
ndb_mgmd 418 9u IPv4 0x33a882b4d147fe85 0t0 TCP localhost:50218->localhost:mysql-cluster (ESTABLISHED)
ndb_mgmd 418 10u IPv4 0x33a882b4d26901a5 0t0 TCP localhost:mysql-cluster->localhost:50218 (ESTABLISHED)
3.To kill the processes on the specific port (for me : 1186) I ran following command:
sof -P | grep '1186' | awk '{print $2}' | xargs kill -9
4.I repeated the steps listed in mySql Cluster installation pdf again:
$PATH/mysqlc/bin/ndb_mgmd -f conf/config.ini --initial --configdir=/$PATH/my_cluster/conf/
$PATH/mysqlc/bin/ndbd -c localhost:1186
Hope this helps!
Hope this will be useful
In my case, two data node were connected already
you can check this out in your management node
[root#ab0]# ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> show
what i did was
ndb_mgm> shutdown
and then execute the restart command. it works for me
Check that the datadir exists and is writeable with "ls -ld /home/netdb/mysql_cluster/data" on datanode1.