Can't install cloudwatch agent by cloudformation on Amazon ECS-optimized AMI - json

I am creating a cloudformation template, which creates some resources as EC2 instance, autoscaling group and launchConfiguration.
By the userData property of the launchConfiguration resource, I tried to install the Cloudwatch agent as follows:
"UserData":{ "Fn::Base64" : {
"Fn::Join" : ["", [
"#!/bin/bash -xe\n",
"yum -y install aws-cfn-bootstrap\n",
"/opt/aws/bin/cfn-init -v",
" --stack ", { "Ref": "AWS::StackName" },
" --resource LaunchCongig",
" --region ", { "Ref" : "AWS::Region" },"\n",
"yum -y install wget\n",
"# Get the CloudWatch Logs agent\n",
"wget https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py\n",
"# Install the CloudWatch Logs agent\n",
"python ./awslogs-agent-setup.py -n -r ", { "Ref" : "AWS::Region" }, " -c /etc/cwlogs.cfg || error_exit 'Failed to run CloudWatch Logs agent setup'\n",
"service awslogs start"
]]}
After ssh into the instance, I checked the file /var/log/cloud-init-output.log to see if everything is fine, but here is what I got:
+ wget https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py
--2017-02-17 14:36:10-- https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.226.59
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.226.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47998 (47K) [text/x-python]
Saving to: ‘awslogs-agent-setup.py’
0K .......... .......... .......... .......... ...... 100% 196K=0.2s
2017-02-17 14:36:10 (196 KB/s) - ‘awslogs-agent-setup.py’ saved [47998/47998]
+ python ./awslogs-agent-setup.py -n -r eu-west-1 -c /etc/cwlogs.cfg
Step 1 of 5: Installing pip ...Traceback (most recent call last):
File "./awslogs-agent-setup.py", line 1144, in <module>
main()
File "./awslogs-agent-setup.py", line 1140, in main
setup.setup_artifacts()
File "./awslogs-agent-setup.py", line 693, in setup_artifacts
self.install_pip()
File "./awslogs-agent-setup.py", line 600, in install_pip
fail("Could not install pip. Please try again or see " + AGENT_SETUP_LOG_FILE + " for more details")
TypeError: fail() takes exactly 2 arguments (1 given)
+ error_exit 'Failed to run CloudWatch Logs agent setup'
/var/lib/cloud/instance/scripts/part-001: line 8: error_exit: command not found
Feb 17 14:36:12 cloud-init[2798]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [127]
Feb 17 14:36:12 cloud-init[2798]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Feb 17 14:36:12 cloud-init[2798]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 0.7.6 finished at Fri, 17 Feb 2017 14:36:12 +0000. Datasource DataSourceEc2. Up 85.78 seconds
What is wrong with this script? Is there any other way to install the agent?
Thank you.
EDIT:
I figured out that is because maybe the python-pip package didn't get installed so I added this to the userData:
"yum -y install python-pip\n",
After that I played the template again and strangely I got the same Error.
I am usinh an Amazon ECS-optimized AMI

I solved the problem by installing the agent directly by yum awslogs:
"UserData":{ "Fn::Base64" : {
"Fn::Join" : ["", [
"#!/bin/bash -xe\n",
"yum -y install aws-cfn-bootstrap\n",
"/opt/aws/bin/cfn-init -v",
" --stack ", { "Ref": "AWS::StackName" },
" --resource launchConfig",
" --region ", { "Ref" : "AWS::Region" },"\n",
"yum -y install awslogs\n",
"service awslogs start"
]]}
Here is the output from the log file:
Installed:
awslogs.noarch 0:1.1.2-1.10.amzn1
Dependency Installed:
aws-cli.noarch 0:1.11.29-1.45.amzn1
aws-cli-plugin-cloudwatch-logs.noarch 0:1.3.3-1.15.amzn1
freetype.x86_64 0:2.3.11-15.14.amzn1
libjpeg-turbo.x86_64 0:1.2.90-5.14.amzn1
mailcap.noarch 0:2.1.31-2.7.amzn1
python27-botocore.noarch 0:1.4.86-1.62.amzn1
python27-colorama.noarch 0:0.2.5-1.7.amzn1
python27-dateutil.noarch 0:2.1-1.3.amzn1
python27-docutils.noarch 0:0.11-1.15.amzn1
python27-futures.noarch 0:3.0.3-1.3.amzn1
python27-imaging.x86_64 0:1.1.6-19.9.amzn1
python27-jmespath.noarch 0:0.9.0-1.11.amzn1
python27-ply.noarch 0:3.4-3.12.amzn1
python27-pyasn1.noarch 0:0.1.7-2.9.amzn1
python27-rsa.noarch 0:3.4.1-1.8.amzn1
Complete!
+ service awslogs start
Starting awslogs: [ OK ]
Cloud-init v. 0.7.6 finished at Fri, 17 Feb 2017 15:33:42 +0000. Datasource DataSourceEc2. Up 83.47 seconds
Everything works fine this way. Hope that will help someone someday!

For ECS specifically, see Using CloudWatch Logs with Container Instances in the EC2 Container Service documentation for details on configuring CloudWatch Logs. The documentation recommends using yum install -y awslogs instead of the Python install script.
The documentation provides a complete sample in the Configuring CloudWatch Logs at Launch with User Data section.
In your case, since you're already managing your config files using cfn-init and CloudFormation::Init metadata in CloudFormation, you don't need any complex parsing of config files in your User-Data script, but you can still use the script as a reference. One thing worth adding to your User-Data script is running chkconfig awslogs on to make sure the service continues running on the instance after a reboot.

Related

CodeBuild succeeds while Node build command has failed

Let's assume that we have AWS CodeBuild project which buildspec file contains the following:
"phases": {
"install": {
"runtime-versions": {
"nodejs": 14
}
},
"build": {
"commands": [
"echo Build started on `date`",
"npm ci && npm audit --audit-level=critical",
"node -r esm index.js --env=CLOUDTEST"
]
},
"post_build": {
"commands": [
"echo Build completed on `date`"
]
}
It runs successfully, but if, for example, env value is wrong, node command will fail, but the build still succeeds.
Question: what should I do to make CodeBuild project to fail when node command fails?
The build phase transitions to post_build by default, regardless whether the build succeeds or fails. To override this behaviour, explicitly set the phase's on-failure behaviour:
"build": {
"on-failure": "ABORT",
"commands": ["node BOOM"]
},
The bad node BOOM command causes the execution to fail immediately. Logs tail:
[Container] 2022/04/29 11:21:10 Command did not exit successfully node BOOM exit status 1
[Container] 2022/04/29 11:21:10 Phase complete: BUILD State: FAILED_WITH_ABORT
[Container] 2022/04/29 11:21:10 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: node BOOM. Reason: exit status 1

Crontab to reload PM2 cluster processes

I'm around this issue for one day with no effective results. Imagine I want to reload (zero downtime, thus not restart) a PM2 cluster process every hour with crontab.
The PM2 app is this one, which I start with sudo pm2 start app.json (btw, I think it's not relevant, but I'm using pm2 as sudo, I can't recall why)
{
"apps" : [{
"name" : "autocosts.prod",
"script" : "bin/server.js",
"cwd" : "/var/www/autocosts.prod/",
"node_args" : "--use_strict",
"args" : "-r prod --print --pdf --social --googleCaptcha --googleAnalytics --database",
"exec_mode" : "cluster",
"instances" : 4,
"wait_ready" : true,
"listen_timeout" : 50000,
"watch" : false,
"exp_backoff_restart_delay" : 200,
"env": {
"NODE_ENV": "production"
},
"log_date_format": "DD-MM-YYYY"
}]
}
My crontab line is
# reload every hour
0 * * * * /usr/local/bin/node /usr/local/bin/pm2 reload /var/www/autocosts.prod/app.json > /var/log/pm2/app.cron.log 2>&1
But on that log file I get an error
[PM2] Applying action reloadProcessId on app [autocosts.prod](ids: [ 0, 1, 2, 3 ])
[PM2][ERROR] Process 0 not found
Process 0 not found
It seems PM2 can't detect the id numbers of the cluster.
How to have crontabs to reload (zero downtime) a PM2 cluster process?
I found the problem, was indeed due to the fact that I was using pm2 as sudo and crontab in sudo mode has no access to a series of environments.
Change owner of PM2 to yourself (do which pm2 to find the full path of the executable), such that you can use it without sudo. Do the same for ~/.pm2.

Why can I install this Jelastic manifest through the dashboard import function but not throuhg the Jelastic API?

I have the following very simple manifest:
type: install
name: very simple manifest
onInstall:
- log: installing manifest
I can install it from the Jelastic Dashboard. There is an import function in the main menu where I can copy / paste that manifest content and it gets installed. In the Jelastic console, I can see
[15:36:38 manifest.settings]: BEGIN INSTALLATION: very simple manifest
[15:36:39 manifest.settings]: BEGIN HANDLE EVENT: {"topic":"application/install","envAppid":""}
[15:36:39 manifest.settings:1]:> installing manifest
[15:36:39 manifest.settings]: END HANDLE EVENT: application/install
[15:36:39 manifest.settings]: END INSTALLATION: very simple manifest
and the Jelastic dashboard confirms installation.
Now, when I do the same, but via the Jelastic REST API, i.e. using the endpoint
http://my-jelastic-provide.com/1.0/marketplace/jps/REST/install
with the relevant data, then, it doesn't install. Instead, I get the strange error message
Can\'t find environment by domain [jelasticclient-master-0954606]
where jelasticclient-master-0954606 is the envName I set.
However, if I change my manifest to e.g.
type: install
name: very simple manifest
nodes:
count: 1
cloudlets: 4
nodeGroup: cp
image: alpine:latest
skipNodeEmails: true
onInstall:
- log: installing manifest
then it installs perfectly. What am I missing?
I am using Jelastic v6.0.2.
Your "very simple manifest" doesn't suppose any environment name to be passed.
That's why when you pass it you get an error "Can't find environment by domain [domain-name]" (Example1).
When you don't have the "nodes" parameter in the manifest (as in your second example), you shouldn't pass any environment name (Example2) or should pass the existing environment name (response is in Example3).
Example1:
curl -X POST 'https://jca.host-domain/1.0/marketplace/jps/rest/install' \
-d 'envName=jelasticclient-master-0954606' \
-d session=*** \
-d skipNodeEmails=1 \
-d ownerUid=UID \
--data-urlencode 'jps={ "type": "install", "name": "very simple manifest", "onInstall": [ { "log": "installing manifest" } ] }'
The response is:
{"result":11,"response":{"result":11,"source":"JEL","error":"domain [jelasticclient-master-0954606] doesn't exist"},"source":"JEL","error":"domain [jelasticclient-master-0954606] doesn't exist"}
When the environment name is not passed (Example2),
curl -X POST 'https://jca.host-domain/1.0/marketplace/jps/rest/install' \
-d session=*** \
-d skipNodeEmails=1 \
-d ownerUid=UID \
--data-urlencode 'jps={ "type": "install", "name": "very simple manifest", "onInstall": [ { "log": "installing manifest" } ] }'
the response is
{"result":0,"uniqueName":"3c819586-2ef7-4691-9faa-d3059459d20e","response":{"result":0,"uniqueName":"3c819586-2ef7-4691-9faa-d3059459d20e","successText":"","appid":""},"appid":"","successText":""}
When the environment with envName=jelasticclient-master-0954606 already exists, the response of the same request from the Example1 is as this (Example3)
{"result":0,"uniqueName":"b52a8db9-8850-4b66-958a-3dee3345b923","response":{"result":0,"uniqueName":"b52a8db9-8850-4b66-958a-3dee3345b923","successText":"","appid":"7b0c465f6c9573b8d8ce3ed59591781b"},"appid":"7b0c465f6c9573b8d8ce3ed59591781b","successText":""}
In other words, if you pass the environment name when deploying this "very simple manifest" this manifest is installed like an add-on because there is no "nodes" parameter in it but there is no existing environment "jelasticclient-master-0954606" to install this "add-on".

How to manually recreate the bootstrap client certificate for OpenShift 3.11 master?

Our origin-node.service on the master node fails with:
root#master> systemctl start origin-node.service
Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.
root#master> systemctl status origin-node.service -l
[...]
May 05 07:17:47 master origin-node[44066]: bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2020-02-20 13:14:27 +0000 UTC
May 05 07:17:47 master origin-node[44066]: bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
May 05 07:17:47 master origin-node[44066]: certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
May 05 07:17:47 master origin-node[44066]: server.go:262] failed to run Kubelet: cannot create certificate signing request: Post https://lb.openshift-cluster.mydomain.com:8443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: EOF
So it seems that kubelet-client-current.pem and/or kubelet-server-current.pem contains an expired certificate and the service tries to create a CSR using an endpoint which is probably not yet available (because the master is down). We tried redeploying the certificates according to the OpenShift documentation Redeploying Certificates, but this fails while detecting an expired certificate:
root#master> ansible-playbook -i /etc/ansible/hosts openshift-master/redeploy-openshift-ca.yml
[...]
TASK [openshift_certificate_expiry : Fail when certs are near or already expired] *******************************************************************************************************************************************
fatal: [master.openshift-cluster.mydomain.com]: FAILED! => {"changed": false, "msg": "Cluster certificates found to be expired or within 60 days of expiring. You may view the report at /root/cert-expiry-report.20200505T042754.html or /root/cert-expiry-report.20200505T042754.json.\n"}
[...]
root#master> cat /root/cert-expiry-report.20200505T042754.json
[...]
"kubeconfigs": [
{
"cert_cn": "O:system:cluster-admins, CN:system:admin",
"days_remaining": -75,
"expiry": "2020-02-20 13:14:27",
"health": "expired",
"issuer": "CN=openshift-signer#1519045219 ",
"path": "/etc/origin/node/node.kubeconfig",
"serial": 27,
"serial_hex": "0x1b"
},
{
"cert_cn": "O:system:cluster-admins, CN:system:admin",
"days_remaining": -75,
"expiry": "2020-02-20 13:14:27",
"health": "expired",
"issuer": "CN=openshift-signer#1519045219 ",
"path": "/etc/origin/node/node.kubeconfig",
"serial": 27,
"serial_hex": "0x1b"
},
[...]
"summary": {
"expired": 2,
"ok": 22,
"total": 24,
"warning": 0
}
}
There is a guide for OpenShift 4.4 for Recovering from expired control plane certificates, but that does not apply for 3.11 and we did not find such a guide for our version.
Is it possible to recreate the expired certificates without a running master node for 3.11? Thanks for any help.
OpenShift Ansible: https://github.com/openshift/openshift-ansible/releases/tag/openshift-ansible-3.11.153-2
Update 2020-05-06: I also executed redeploy-certificates.yml, but it fails at the same TASK:
root#master> ansible-playbook -i /etc/ansible/hosts playbooks/redeploy-certificates.yml
[...]
TASK [openshift_certificate_expiry : Fail when certs are near or already expired] ******************************************************************************
Wednesday 06 May 2020 04:07:06 -0400 (0:00:00.909) 0:01:07.582 *********
fatal: [master.openshift-cluster.mydomain.com]: FAILED! => {"changed": false, "msg": "Cluster certificates found to be expired or within 60 days of expiring. You may view the report at /root/cert-expiry-report.20200506T040603.html or /root/cert-expiry-report.20200506T040603.json.\n"}
Update 2020-05-11: Running with -e openshift_certificate_expiry_fail_on_warn=False results in:
root#master> ansible-playbook -i /etc/ansible/hosts -e openshift_certificate_expiry_fail_on_warn=False playbooks/redeploy-certificates.yml
[...]
TASK [Wait for master API to come back online] *****************************************************************************************************************
Monday 11 May 2020 03:48:56 -0400 (0:00:00.111) 0:02:25.186 ************
skipping: [master.openshift-cluster.mydomain.com]
TASK [openshift_control_plane : restart master] ****************************************************************************************************************
Monday 11 May 2020 03:48:56 -0400 (0:00:00.257) 0:02:25.444 ************
changed: [master.openshift-cluster.mydomain.com] => (item=api)
changed: [master.openshift-cluster.mydomain.com] => (item=controllers)
RUNNING HANDLER [openshift_control_plane : verify API server] **************************************************************************************************
Monday 11 May 2020 03:48:57 -0400 (0:00:00.945) 0:02:26.389 ************
FAILED - RETRYING: verify API server (120 retries left).
FAILED - RETRYING: verify API server (119 retries left).
[...]
FAILED - RETRYING: verify API server (1 retries left).
fatal: [master.openshift-cluster.mydomain.com]: FAILED! => {"attempts": 120, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://lb.openshift-cluster.mydomain.com:8443/healthz/ready"], "delta": "0:00:00.182367", "end": "2020-05-11 03:51:52.245644", "msg": "non-zero return code", "rc": 35, "start": "2020-05-11 03:51:52.063277", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
root#master> systemctl status origin-node.service -l
[...]
May 11 04:23:28 master.openshift-cluster.mydomain.com origin-node[109972]: E0511 04:23:28.077964 109972 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2020-02-20 13:14:27 +0000 UTC
May 11 04:23:28 master.openshift-cluster.mydomain.com origin-node[109972]: I0511 04:23:28.078001 109972 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
May 11 04:23:28 master.openshift-cluster.mydomain.com origin-node[109972]: I0511 04:23:28.080555 109972 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
May 11 04:23:28 master.openshift-cluster.mydomain.com origin-node[109972]: F0511 04:23:28.130968 109972 server.go:262] failed to run Kubelet: cannot create certificate signing request: Post https://lb.openshift-cluster.mydomain.com:8443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: EOF
[...]
I have this same case in customer environment, this error is because the certified was expiry, i "cheated" changing da S.O date before the expiry date. And the origin-node service started in my masters:
systemctl status origin-node
● origin-node.service - OpenShift Node
Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
Active: active (running) since Sáb 2021-02-20 20:22:21 -02; 6min ago
Docs: https://github.com/openshift/origin
Main PID: 37230 (hyperkube)
Memory: 79.0M
CGroup: /system.slice/origin-node.service
└─37230 /usr/bin/hyperkube kubelet --v=2 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=5m --authorization-mode=Webhook --authorization-webhook-c...
Você tem mensagem de correio em /var/spool/mail/okd
The openshift_certificate_expiry role uses the openshift_certificate_expiry_fail_on_warn variable to determine if the playbook should fail when the days left are less than openshift_certificate_expiry_warning_days.
So try running the redeploy-certificates.yml with this additional variable set to "False":
ansible-playbook -i /etc/ansible/hosts -e openshift_certificate_expiry_fail_on_warn=False playbooks/redeploy-certificates.yml

Packer Script exited with non-zero exit status : 127

I am trying to provision an AWS AMI but the packer script terminates with the following error.
Build 'amazon-ebs' errored: Script exited with non-zero exit status: 127
==> Some builds didn't complete successfully and had errors:
--> amazon-ebs: Script exited with non-zero exit status: 127
==> Builds finished but no artifacts were created.
My template for Packer is as follows:
{
"variables": {
"aws_access_key": "{{env `MY_ACCESS_KEY`}}",
"aws_secret_key": "{{env `MY_SECRET_KEY`}}"
},
"builders": [{
"type": "amazon-ebs",
"access_key": "{{user `aws_access_key`}}",
"secret_key": "{{user `aws_secret_key`}}",
"region": "us-east-1",
"source_ami":"ami-8c1be5f6",
"instance_type": "t2.micro",
"ssh_username": "ec2-user",
"ami_name": "packer-example {{timestamp}}"
}],
"provisioners":[
{
"type": "shell",
"script": "provision.sh"
}]}
Error log is as follows: PACKER_LOG=1 packer build template.json
2017/11/01 01:30:37 [INFO] (telemetry) ending amazon-ebs
2017/11/01 01:30:37 [INFO] (telemetry) found error: Script exited with non-zero exit status: 127
2017/11/01 01:30:37 ui error: Build 'amazon-ebs' errored: Script exited with non-zero exit status: 127
2017/11/01 01:30:37 Builds completed. Waiting on interrupt barrier...
2017/11/01 01:30:37 machine readable: error-count []string{"1"}
2017/11/01 01:30:37 ui error:
==> Some builds didn't complete successfully and had errors:
2017/11/01 01:30:37 machine readable: amazon-ebs,error []string{"Script exited with non-zero exit status: 127"}
2017/11/01 01:30:37 ui error: --> amazon-ebs: Script exited with non-zero exit status: 127
2017/11/01 01:30:37 ui:
==> Builds finished but no artifacts were created.
Build 'amazon-ebs' errored: Script exited with non-zero exit status: 127
provision.sh comprises of
#!/bin/bash
sudo yum install httpd -y
sudo yum update -y
sudo aws s3 cp s3://zbcxlkxcjvlxkj/index1.html /var/www/html/ --region us-east-1
sudo service httpd start
sudo chkconfig httpd on
Exit code 127 is returned by bash when a command is not found. Most likely you do not have all of the commands used in your script installed on your image prior to running it.