gce_instance monitoring and logging Disk Usage in bytes works for a few minutes then breaks - google-compute-engine

I have followed the tutorials and successfully installed the monitoring and logging agents on my debian9 machine. All statuses ok.
In metrics explorer the gce_instance Disk Usage in bytes works for a few minutes then breaks. I get the following error on my machine:
Aug 04 15:43:23 master collectd[13129]: write_gcm: Unsuccessful HTTP request 400: {
"error": {
"code": 400,
"message": "Field timeSeries[2].points[0].interval.s
tart_time had an invalid value of \"2020-08-04T07:43:22.681979-07:00\": The start time must be before th
e end time (2020-08-04T07:43:22.681979-07:00) for the non-gauge metric 'agent.googleapis.com/agent/api_r
equest_count'.",
"status": "INVALID_ARGUMENT"
}
}
Aug 04 15:43:23 master collectd[13129]: write_gcm: Error talking to the endpoint.
Aug 04 15:43:23 master collectd[13129]: write_gcm: wg_transmit_unique_segment failed.
Aug 04 15:43:23 master collectd[13129]: write_gcm: wg_transmit_unique_segments failed. Flushing.
EDITED
Anyone experiencing these issues, it's a confirmed bug now.
I issued a support ticket in google issue tracker

These error messages are harmless, you are not losing metrics so you can ignore them without any problem.
The root cause is a server-side config change and affects all agents. That change only affected the verbosity of the responses, not the processing of the requests. some of the incoming metrics were silently dropped before that change, and are now dropped noisily.
There is a issue tracker where you can see more details about the issue that are affecting you.

Related

SendGrid misconfiguration on Google Cloud (535 Authentication failed)

So I've installed SendGrid on GoogleCE with Centos base following the documented instruction from Google:
[https://cloud.google.com/compute/docs/tutorials/sending-mail/using-sendgrid#before-you-begin][1]
Using the test from the command line (various accounts):
echo 'MESSAGE' | mail -s 'SUBJECT' GJ******#gmail.com
the /var/log/maillog says with several lines of 50 or so attempts in 1 second:
postfix/error[32324]: A293210062D7: to=<GJ********#gmail.com>, relay=none, delay=145998, delays=145997/1.2/0/0, dsn=4.0.0, status=deferred (delivery temporarily suspended: SASL authentication failed; server smtp.sendgrid.net[167.89.115.53] said: 535 Authentication failed: The provided authorization grant is invalid, expired, or revoked)
And the message is queued up and retried every few hours. Now, messing around, I could change the port setting from 2525 to one of the regular ports that isn't blocked by google and the email gets bounced right away to the user account in the mail test message.
I made sure to use the api key generated, the SendGrid system say no attempt have been made or bounced or whatever.
There were other errors in the maillog, actually as it tries every second, pages of them, but I change the perms in that directory so no longer, but maybe gives a clue to how it's misconfigured?
Oct 31 19:04:14 beadc postfix/pickup[15119]: fatal: chdir("/var/spool/postfix"): Permission denied
Oct 31 19:04:15 beadc postfix/master[1264]: warning: process /usr/libexec/postfix/qmgr pid 15118 exit status 1
Oct 31 19:04:15 beadc postfix/master[1264]: warning: /usr/libexec/postfix/qmgr: bad command startup -- throttling
Oct 31 19:04:15 beadc postfix/master[1264]: warning: process /usr/libexec/postfix/pickup pid 15119 exit status 1
Oct 31 19:04:15 beadc postfix/master[1264]: warning: /usr/libexec/postfix/pickup: bad command startup -- throttling
The only info I can find searching about the error is that it means a SendGrid misconfiguration.
Any ideas as to what the misconfiguration might be?
I've determined the 535 error was a port/firewall issue. Which means that the 550 error I had on the other port still exists.
Check your firewall settings on 535
[https://cloud.google.com/compute/docs/tutorials/sending-mail/][1]

Openshift 3 , 503 Error (No server is available to handle this request)

I have created a web application using jsp/tiles/struts/mysql/tomcat. I created new project on Openshift 3 console (Openshift online) https://console.preview.openshift.com/console/ then added tomcat/mySql. I was getting 503 error sometimes and other times, same page was working as expected. 503 error came randomly for any page from my project. When I get 503 error, I refresh some no of times and it goes away, and my page is correctly displayed.
Error that I see is:
"503 Service Unavailable
No server is available to handle this request. "
I did some research:
What I understand from this openshift 2 link:
https://blog.openshift.com/how-to-host-your-java-ee-application-with-auto-scaling/
is that to correct 503 error:
SSH into your application gear using rhc ssh --app <app_name>
Change directory to haproxy/conf
change the following in haproxy.cfg option httpchk GET / to option httpchk GET /api/v1/ping
Restart the HAProxy cartridge from your local machine using RHC rhc cartridge-restart --cartridge haproxy
I dont know if it is also applicable to openshift 3. In openshift 3 where is haproxy.log, haproxy.cfg, haproxy/conf or its slightly different in openshift 3. (Nut thanks to Warrens comments, yes he saw 503 error in openshift related to HAProxy)
Now after 1 week after posting this question:
I am getting Quota Reached Error. I am able to build my project but all deployments are failing. I wonder if 503 error that I was getting earlier(either completely or partially) was related to Quota reached. How should I proceed now.
curl -i localhost:8080/GEA
HTTP/1.1 302 Found Server:
Apache-Coyote/1.1
Location: http://localhost:8080/GEA/
Transfer-Encoding: chunked Date: Tue, 11 Apr 2017 18:03:25 GMT
Tomcat logs do not show any application error.
Will Readiness Probe and Liveness Probe help me? I have not set them yet.
Nor do I know how to set them.
Will scaling help me (I dont know how to set it either)
Do I have to set memory/... all at maximum allowed to ensure project runs smooth?
For me I had a similar situation of getting 503's sometimes and sometimes getting my actual page. the reason was because you have haproxy on the frontend handling the requests. Depending on your setup you may even have a few haproxy pods and your request could be funneled between one of the pods. So as in my case one pod was working and the other not.
So basically
oc get pods -n default
NAME READY STATUS RESTARTS AGE
docker-registry-7-i02rh 1/1 Running 0 75d
registry-console-12-wciib 1/1 Running 0 67d
router-1-533cg 1/1 Running 3 76d
router-1-9utld 1/1 Running 1 76d
router-1-uwf64 1/1 Running 1 76d
As you can see in my output default namespace is where my router(haproxy) pods live. If I change to that namespace
oc project default
Then run
oc logs -f router-1-533cg
on each of the pods you will most likely find a sepcific pod that is behaving bad. You can simply delete, and the replication controller will create a new one

Stackdriver Monitoring with full access scope not authorized

After deploying a brand new Google Compute Engine instance with full API access and installing the Stackdriver agent, the Monitoring is not showing any metrics from the agent.
According to the Install Agent manual no further settings (like manually configurating an API key) should be required.
The agent service status also shows the following error:
$ systemctl status stackdriver-agent
Jul 13 10:14:00 host stackdriver-agent[21203]: [ OK ]
Jul 13 10:14:00 host systemd[1]: Started LSB: start and stop Stackdriver Agent.
Jul 13 10:14:00 host collectd[21226]: Initialization complete, entering read-loop.
Jul 13 10:14:00 host collectd[21226]: match_throttle_metadata_keys: 1 history entries, 1 distinct keys, 46 bytes server memory.
Jul 13 10:14:00 host collectd[21226]: tcpconns plugin: Reading from netlink succeeded. Will use the netlink method from now on.
Jul 13 10:14:00 host collectd[21226]: write_gcm: Asking metadata server for auth token
Jul 13 10:14:01 host collectd[21226]: write_gcm: Unsuccessful HTTP request 403: {
"error": {
"code": 403,...
Jul 13 10:14:01 host collectd[21226]: write_gcm: Error talking to the endpoint.
Jul 13 10:14:01 host collectd[21226]: write_gcm: wg_transmit_unique_segment failed.
Jul 13 10:14:01 host collectd[21226]: write_gcm: wg_transmit_unique_segments failed. Flushing.
Google Cloud Console shows the instance having:
Cloud API access scopes
This instance has full API access to all Google Cloud services.
and running the following command inside the instance shows:
$ curl --silent -f -H "Metadata-Flavor: Google" http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/scopes
https://www.googleapis.com/auth/cloud-platform
Any thoughts on what is going wrong?
I figured it out:
You have to enable the Google Monitoring API in the API Manager, which is not enabled by default. No need to specify an API key, the default application credentials are picked up.
Interestingly, I have two projects which also use Stackdriver Monitoring since early this year and those do not require the Google Monitoring API to be enabled.

Google Cloud Logging + google-fluentd Dropping Messages

I have a rather small (1-2 node) kubernetes cluster running in GKE with ±40 Pods running. The problem at hand is that it's not logging to the GCE Console properly. I see lots of messages from the fluentd container(s) in the following format:
$ kubectl logs fluentd-cloud-logging-gke-xxxxxxxx-node-xxxx
2016-02-02 23:30:09 +0000 [warn]: Dropping 10 log message(s) error_class="Google::APIClient::ClientError" error="Project has not enabled the API. Please use Google Developers Console to activate the 'logging' API for your project."
2016-02-02 23:30:09 +0000 [warn]: Dropping 1 log message(s) error_class="Google::APIClient::ClientError" error="Project has not enabled the API. Please use Google Developers Console to activate the 'logging' API for your project."
2016-02-02 23:30:09 +0000 [warn]: Dropping 3 log message(s) error_class="Google::APIClient::ClientError" error="Project has not enabled the API. Please use Google Developers Console to activate the 'logging' API for your project."
2016-02-02 23:30:09 +0000 [warn]: Dropping 41 log message(s) error_class="Google::APIClient::ClientError" error="Project has not enabled the API. Please use Google Developers Console to activate the 'logging' API for your project."
2016-02-02 23:30:09 +0000 [warn]: Dropping 5 log message(s) error_class="Google::APIClient::ClientError" error="Project has not enabled the API. Please use Google Developers Console to activate the 'logging' API for your project."
...and so on. I'm seeing ~5 of these messages per second, so I know things are producing logs. However, in the compute engine console I see something like the following:
So somewhere in between I'm obviously loosing lots of messages. Strange though, that I'm not loosing all these messages!
The cluster is configured with Logging.write and Monitoring.all privileges as suggested in GH issue #15727
It's definitely confusing that some of the logs are showing up. Given that error message, I'd expect none of your logs to be showing up in the viewer, since it sounds like the logging API hasn't been enabled for your project yet.
You can do so from the Developers Console, here. Try going there, clicking the Enable API button, and seeing whether the errors keep coming.

Google Compute Engine VM instance error in google.startup.script

Upon rebooting the Google Compute Engine VM instance, I see these errors:
startupscript: Finished running startup script /var/run/google.startup.script
xxxx accounts-from-metadata: WARNING error while trying to update accounts: <urlopen error [Errno 101] Network is unreachable>
xxxx accounts-from-metadata: WARNING error while trying to update accounts: <urlopen error [Errno 101] Network is unreachable>
What could be the problem?
Update: Upon viewing the original question and reformatting it, it looks like there's a network error at bootup (was hidden due to the text in <...> being treated as HTML and not viewable), so my earlier answer (below) may not be applicable. Leaving it here for future reference.
Please check your network settings, firewalls, etc. in the meantime.
Original text:
You may have a syntax error in the sshKeys metadata key. The format is:
<username>:<protocol> <key-blob> <username#example.com>
The right hand side of the : is essentially the contents of your public key, e.g., ~/.ssh/google_compute_engine.pub.
To see your current metadata key:
ssh into the instance, e.g., via gcloud compute ssh, or via the SSH button in Developers Console
Load this key via:
curl http://metadata/computeMetadata/v1/project/attributes/sshKeys \
-H "Metadata-Flavor: Google"
and check the formatting.
You can then change the metadata on your instance.