Stackdriver unable to determine collectd endpoint - google-compute-engine

All my hosts stopped reporting stats to collectd google gateway. This is due to some internal change on google side.
In logs files I see this:
Jan 13 08:52:36 ign-rpt01 systemd[1]: Stopping LSB: start and stop Stackdriver Agent...
Jan 13 08:52:36 ign-rpt01 stackdriver-agent[10768]: mesg: ttyname failed: Inappropriate ioctl for device
Jan 13 08:52:36 ign-rpt01 stackdriver-agent[10768]: * Stopping Stackdriver metrics collection agent stackdriver-agent
Jan 13 08:52:37 ign-rpt01 stackdriver-agent[10768]: ...done.
Jan 13 08:52:37 ign-rpt01 systemd[1]: Stopped LSB: start and stop Stackdriver Agent.
Jan 13 08:52:37 ign-rpt01 systemd[1]: Starting LSB: start and stop Stackdriver Agent...
Jan 13 08:52:37 ign-rpt01 stackdriver-agent[10794]: mesg: ttyname failed: Inappropriate ioctl for device
Jan 13 08:52:37 ign-rpt01 stackdriver-agent[10794]: * Starting Stackdriver metrics collection agent stackdriver-agent
Jan 13 08:52:38 ign-rpt01 stackdriver-agent[10794]: Unable to determine collectd endpoint!
Jan 13 08:52:38 ign-rpt01 stackdriver-agent[10794]: * not starting, configuration error
Jan 13 08:52:38 ign-rpt01 stackdriver-agent[10794]: ...fail!
Jan 13 08:52:38 ign-rpt01 systemd[1]: Started LSB: start and stop Stackdriver Agent.
Jan 13 08:53:16 ign-rpt01 extractd[10869]: Error sending processes data: Stackdriver gateway replied with a 401: <html><title>HTTP 401: Unauthorized (Invalid API key)</title><body>HTTP 401: Unauthorized (Invalid API key)</body></html>
Jan 13 08:54:16 ign-rpt01 extractd[10903]: Error sending processes data: Stackdriver gateway replied with a 401: <html><title>HTTP 401: Unauthorized (Invalid API key)</title><body>HTTP 401: Unauthorized (Invalid API key)</body></html>
Jan 13 08:55:16 ign-rpt01 extractd[10947]: Error sending processes data: Stackdriver gateway replied with a 401: <html><title>HTTP 401: Unauthorized (Invalid API key)</title><body>HTTP 401: Unauthorized (Invalid API key)</body></html>

When I go to stackdriver account settings:
The following instances are using a deprecated configuration of the monitoring agent. Alerting policies referencing metrics from these agents do not work as intended and are currently unsupported. Dashboards using metrics from these agents are also unsupported and will soon stop working.
Please update your monitoring agent. Learn more
Okay, it turns out that now only --write-gcm is supported now.
TL;DR version
Just run this:
curl -O "https://repo.stackdriver.com/stack-install.sh"
sudo bash stack-install.sh --write-gcm
And hey, my stats are starting to come in again:

Related

gce_instance monitoring and logging Disk Usage in bytes works for a few minutes then breaks

I have followed the tutorials and successfully installed the monitoring and logging agents on my debian9 machine. All statuses ok.
In metrics explorer the gce_instance Disk Usage in bytes works for a few minutes then breaks. I get the following error on my machine:
Aug 04 15:43:23 master collectd[13129]: write_gcm: Unsuccessful HTTP request 400: {
"error": {
"code": 400,
"message": "Field timeSeries[2].points[0].interval.s
tart_time had an invalid value of \"2020-08-04T07:43:22.681979-07:00\": The start time must be before th
e end time (2020-08-04T07:43:22.681979-07:00) for the non-gauge metric 'agent.googleapis.com/agent/api_r
equest_count'.",
"status": "INVALID_ARGUMENT"
}
}
Aug 04 15:43:23 master collectd[13129]: write_gcm: Error talking to the endpoint.
Aug 04 15:43:23 master collectd[13129]: write_gcm: wg_transmit_unique_segment failed.
Aug 04 15:43:23 master collectd[13129]: write_gcm: wg_transmit_unique_segments failed. Flushing.
EDITED
Anyone experiencing these issues, it's a confirmed bug now.
I issued a support ticket in google issue tracker
These error messages are harmless, you are not losing metrics so you can ignore them without any problem.
The root cause is a server-side config change and affects all agents. That change only affected the verbosity of the responses, not the processing of the requests. some of the incoming metrics were silently dropped before that change, and are now dropped noisily.
There is a issue tracker where you can see more details about the issue that are affecting you.

SendGrid misconfiguration on Google Cloud (535 Authentication failed)

So I've installed SendGrid on GoogleCE with Centos base following the documented instruction from Google:
[https://cloud.google.com/compute/docs/tutorials/sending-mail/using-sendgrid#before-you-begin][1]
Using the test from the command line (various accounts):
echo 'MESSAGE' | mail -s 'SUBJECT' GJ******#gmail.com
the /var/log/maillog says with several lines of 50 or so attempts in 1 second:
postfix/error[32324]: A293210062D7: to=<GJ********#gmail.com>, relay=none, delay=145998, delays=145997/1.2/0/0, dsn=4.0.0, status=deferred (delivery temporarily suspended: SASL authentication failed; server smtp.sendgrid.net[167.89.115.53] said: 535 Authentication failed: The provided authorization grant is invalid, expired, or revoked)
And the message is queued up and retried every few hours. Now, messing around, I could change the port setting from 2525 to one of the regular ports that isn't blocked by google and the email gets bounced right away to the user account in the mail test message.
I made sure to use the api key generated, the SendGrid system say no attempt have been made or bounced or whatever.
There were other errors in the maillog, actually as it tries every second, pages of them, but I change the perms in that directory so no longer, but maybe gives a clue to how it's misconfigured?
Oct 31 19:04:14 beadc postfix/pickup[15119]: fatal: chdir("/var/spool/postfix"): Permission denied
Oct 31 19:04:15 beadc postfix/master[1264]: warning: process /usr/libexec/postfix/qmgr pid 15118 exit status 1
Oct 31 19:04:15 beadc postfix/master[1264]: warning: /usr/libexec/postfix/qmgr: bad command startup -- throttling
Oct 31 19:04:15 beadc postfix/master[1264]: warning: process /usr/libexec/postfix/pickup pid 15119 exit status 1
Oct 31 19:04:15 beadc postfix/master[1264]: warning: /usr/libexec/postfix/pickup: bad command startup -- throttling
The only info I can find searching about the error is that it means a SendGrid misconfiguration.
Any ideas as to what the misconfiguration might be?
I've determined the 535 error was a port/firewall issue. Which means that the 550 error I had on the other port still exists.
Check your firewall settings on 535
[https://cloud.google.com/compute/docs/tutorials/sending-mail/][1]

Stackdriver Monitoring with full access scope not authorized

After deploying a brand new Google Compute Engine instance with full API access and installing the Stackdriver agent, the Monitoring is not showing any metrics from the agent.
According to the Install Agent manual no further settings (like manually configurating an API key) should be required.
The agent service status also shows the following error:
$ systemctl status stackdriver-agent
Jul 13 10:14:00 host stackdriver-agent[21203]: [ OK ]
Jul 13 10:14:00 host systemd[1]: Started LSB: start and stop Stackdriver Agent.
Jul 13 10:14:00 host collectd[21226]: Initialization complete, entering read-loop.
Jul 13 10:14:00 host collectd[21226]: match_throttle_metadata_keys: 1 history entries, 1 distinct keys, 46 bytes server memory.
Jul 13 10:14:00 host collectd[21226]: tcpconns plugin: Reading from netlink succeeded. Will use the netlink method from now on.
Jul 13 10:14:00 host collectd[21226]: write_gcm: Asking metadata server for auth token
Jul 13 10:14:01 host collectd[21226]: write_gcm: Unsuccessful HTTP request 403: {
"error": {
"code": 403,...
Jul 13 10:14:01 host collectd[21226]: write_gcm: Error talking to the endpoint.
Jul 13 10:14:01 host collectd[21226]: write_gcm: wg_transmit_unique_segment failed.
Jul 13 10:14:01 host collectd[21226]: write_gcm: wg_transmit_unique_segments failed. Flushing.
Google Cloud Console shows the instance having:
Cloud API access scopes
This instance has full API access to all Google Cloud services.
and running the following command inside the instance shows:
$ curl --silent -f -H "Metadata-Flavor: Google" http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/scopes
https://www.googleapis.com/auth/cloud-platform
Any thoughts on what is going wrong?
I figured it out:
You have to enable the Google Monitoring API in the API Manager, which is not enabled by default. No need to specify an API key, the default application credentials are picked up.
Interestingly, I have two projects which also use Stackdriver Monitoring since early this year and those do not require the Google Monitoring API to be enabled.

Not able to start node server

I am getting the following error when I try to start the server manually or from command prompt.
Stopping NodeJS cartridge
Sun Dec 20 2015 10:29:20 GMT-0500 (EST): Stopping application 'nodejs' ...
Sun Dec 20 2015 10:29:20 GMT-0500 (EST): Stopped Node application 'nodejs'
Starting NodeJS cartridge
Sun Dec 20 2015 10:29:21 GMT-0500 (EST): Starting application 'nodejs' ...
Waiting for application port (8080) become available ...
Application 'nodejs' failed to start (port 8080 not available)
Failed to execute: 'control restart' for /var/lib/openshift/5671bca50c1e66a111000114/nodejs
Please provide any solution. Thanks in advance.

Message Archive Management Plugin (Prosody) can't open archive

I'm trying to get Message Archive Management ( mam ) on a prosody server
working.
I tried it with SQLite3, MySQL and PostgreSQL.
Always this log:
Oct 20 14:56:21 general info Hello and welcome to Prosody version 0.9.7
Oct 20 14:56:21 general info Prosody is using the epoll backend for connecti$
Oct 20 14:56:22 localhost:mam error Could not open archive storage
The archive is existing in /var/lib/prosody/.