Set stackdriver alerts for specific error messages - google-cloud-functions

Cannot find a clean way to set Stackdriver alert notifications on errors in cloud functions
I am using a cloud function to process data to cloud data store. There are 2 types of errors that I want to be alerted on:
Technical exceptions which might cause function to 'crash'
Custom errors that we are logging from the cloud function
I have done the below,
Created a log metric searching for specific errors (although this will not work for 'crash' as the error message can be different each time)
Created an alert for this metric in Stackdriver monitoring with parameters as in below code section
This is done as per the answer to the question,
how to create alert per error in stackdriver
For the first trigger of the condition I receive an email. However, on subsequent triggers lets say on the next day, I don't. Also the incident is in 'opened' state.
Resource type: cloud function
Metric:from point 2 above
Aggregation: Aligner: count, Reducer: None, Alignment period: 1m
Configuration: Condition triggers if: Any time series violates, Condition:
is above, Threshold: 0.001, For: 1 min
So I have 3 questions,
Is this the right way to do to satisfy my requirement of creating alerts?
How can I still receive alert notifications for subsequent errors?
How to set the incident to 'resolved' either automatically/ manually?

I was having a similar problem and managed to at least get a mail every time. The "trick" seems to be to use sum instead of count in combination with for most recent value - see the screenshot below.
This causes Stackdriver to send a mail everytime a matching log entry is found and closing the issue a minute later.

Normally, alerts resolve themselves once the alerting policy stops firing. The problem you're having with your alerts not resolving is because your metric only writes non-zero points - if there are no errors, it doesn't write zero. That means that the policy never gets an unambiguous signal that everything is fine, so the alerts just sit there (they'll automatically close after 7 days, but I imagine that's not all that useful for you).
This is a common problem and it's a tricky one to solve. One possibility is to write your policy as a ratio of errors to something non-zero, like request count. As long as the request count is non-zero, the ratio will compute zero if there are no errors, and so an alert on the ratio will automatically resolve. You need to be a bit careful about rounding errors, though - if your request count is high enough, you might potentially miss a single error because the ratio could round to zero.
Aaron Sher, Stackdriver engineer

We got around this issue by having the insertId as a label of the log-based metric we created for every log record we get from the pods running our services.
In the alerting policy, this label helped in two things:
We grouped by it (named as record_id) which served in making each incident unique, so it got reported without waiting for other incidents to get resolved and at the same time it got resolved instantly.
We used it in the documentation of the notification to include a direct link to the issue (log record) itself which was a nice and essential feature to have. https://console.cloud.google.com/logs/viewer?project=MY_PROJECT&advancedFilter=insertId%3D%22${metric.label.record_id}%22
As #Aaron Sher mentioned in his answer, it is a tricky problem. We might have done something not recommended or not efficient, but it works fine and of course we are open for improvement recommendations.

Related

Handling Pubsub messages in Google Functions

I'm learning about the Google cloud functions and I'm setting them up to be triggered by the messages placed in the queue. I think I'm really failing to grasp some concepts here as I have a bunch of questions and can't find answers anywhere. There are a lot of examples explaining functions and clients, but I haven't found examples merging the two.
Functions get triggered by the topic and not by the subscription. This one is weird because as a single topic can have multiple subscriptions and even multiple subscribers per subscription, this would mean the function doesn't acknowledge the messages as it doesn't know which message to acknowledge.
Building on the first question, when a message arrives on the topic, do all the subscriber functions get executed? What about the functions that are in the process of doing some work? What about multiple subscribers on a single subscription?
Can a real pull subscription then even be implemented in a function? That would mean the function runs constantly because of the need to pull the items, which is costly and the wrong thing to do.
Can a message be nacked from the function? It seems the functions are retried only if they are deployed with allowing retries turned on, but then they try to rerun the function immediately and for as long as the retry period is set (default is 7 days) which can cause extreme costs if a function is buggy, and is a totally crap pattern.
All of this makes me think that:
It would be much better implementation to trigger functions from subscriptions and for subscriptions to be able to ack / nack them than listening to topics
I should choose push subscriptions alongside HTTP functions, which seem much more controllable (I might be wrong, haven't tried it)
Can anyone shed some light on this? Can I control the messages easily from the function and can I expect the function to be rerun if a message is nacked or resent?
Perhaps the piece of information that is key is that when you hook a Cloud Pub/Sub topic to a Cloud Function, a push subscription is created by the system in order to send messages to that Cloud Function.
Every cloud function you tie to a topic will have its own subscription and will receive all messages published to the topic. If an instance of the function is already doing work, then another instance could be created to handle the load (or will just be load balanced among instances that are already running). Push subscriptions don't really have a notion of multiple subscribers for the same subscription. From Cloud Pub/Sub's perspective, there is a single endpoint to which to push messages. Cloud Functions receives those messages and distributes them among instances of your Function that the service is running.
It would be very tough to implement a pull subscription as a Cloud Function. You would need a trigger to start the Function and it would have to do all of its work in the time allotted for it to run.
It sounds like you want to nack with a backoff on retrying the message. That is not a feature supported currently, but we are aware of the limitation and are looking to make improvements here soon.

Keep getting 429 (Too Many Requests) throttling errors

I tried to engage with the API team via Twitter but I've not had a response and development is grinding to a halt here...
In short I keep getting a 429 when developing against the OneNote API, I know that this suggests I'm hitting the API too hard, but I'm not.
At worst I'm doing maybe 1 or 2 requests per minute, manually invoked by me as I develop. Sometimes I'll leave it 10-15 minutes between calls, sometimes this works, sometimes not.
I've been working on a particular problem the last few days.
In my code I make a call to get all notebooks, sections & section groups in a single query (filtered to only return data from certain notebooks)
I then make a second call to get all updated pages for those notebooks. I've been fiddling with the filter string to get this second call working (which I now think I have), but 9 times out of 10 I get the 429 on this second API call.
Is there some way of getting my user account whitelisted please?
FWIW this is my second query (the spaces normally get encoded):
/me/notes/pages?count=true&top=100&expand=parentNotebook,parentSection
&filter=(parentNotebook/id eq '{GUID}' or parentNotebook/id eq '{GUID}' or parentNotebook/id eq '{GUID}') and lastModifiedTime gt 2016-08-05T11:34:09.000Z
This does work as I'd expect, the date clause is working now, but I can only test very occasionally as I get the 429.
Incidentally if I run my second filter through the API console I get a 504 "Proxy request timeout", every time. This has been since I've added the parenthesis around the notebook predicate.
So I'm pretty much unable to continue development, how do I resolve this please?
As a short term workaround, please try the following:
Instead of one query:
/me/notes/pages?count=true&top=100&expand=parentNotebook,parentSection&filter=(parentNotebook/id eq '{GUID}' or parentNotebook/id eq '{GUID}' or parentNotebook/id eq '{GUID}') and lastModifiedTime gt 2016-08-05T11:34:09.000Z
Remove the "count=true" (are you using this?) and leave only one parentNotebookId filter. The results will be by default ordered by LastModifiedTime descending (most recent first).
Perform this query for all the notebooks you're interested in:
/me/notes/pages?top=100&expand=parentNotebook,parentSection&filter=parentNotebook/id eq '{GUID}'
Just a hunch: can you try splitting the second call (for getting updated pages) into separate http requests (one for each notebook id)?
Also if what you want are update notifications, webhooks might be the better way to go.
Lastly apologies for the silence on Twitter.
Unfortunately you're hitting a bad bug in our GET Pages API (when called with filters and additional query params). Effectively, we are doing a crappy job of applying the filters when calling our indexing service (which in turn is throttling us). We've identified this as an on-going problem which starts throttling callers especially under heavy load.
Short-term workaround: we're fiddling with our capacity and upp-ing the throttling limits set by our partner indexing service temporarily while we work on the longer term fix. Hopefully this will result in fewer 429s for you going forward.
As a future-proof resolution, I would also encourage you to look into #Jorge's suggested answer. (remove the count=true query param and filter only on 1 parentNotebookId (no lastModifiedTime filtering)

Remove expired data from Superfeedr

We use Superfeedr to load current internal job postings from our cloud recruitment software (Newton). It was just brought to my attention yesterday that positions that are no longer active are still being loaded in our feed.
The raw feed from Newton is correct and it does not have the non-active positions listed. When I access the feed via Superfeedr the inactive positions are being returned. When looking at their documentation, it sounds like this behavior may be by design.
Does anyone know if I am correct or if there is a workaround for this?
Updates
If an entry is updated, they are not propagated by default. This is
because we want to avoid creating numerous false positives.
There is, however, one exception: if a new entry contains a valid,
updated element, and the update is very recent (within three minutes),
you will receive a notification. This means that we will usually
propagate updates for feeds if we receive a ping from the publisher.
https://documentation.superfeedr.com/subscribers.html

Failures in eventual consistent system and user experience [duplicate]

When using distributed and scalable architecture, eventual consistency is often a requirement.
Graphically, how to deal with this eventual consistency?
Users are used to click save, and see the result instantaneously... with eventual consistency it's not possible.
How to deal with the GUI for such scenarios?
Please note the question applies both for desktop applications and web applications.
PS: I'm working with the Microsoft platform, but I imagine the question applies to any technology...
A Task Based UI fits this model great. You create and execute tasks from the UI. You can also have something like a task status monitor to show the user when a task has executed.
Another option is to use some kind of pooling from the client. You send the command, and pool from the client until the command completed and the new data is available. You will have a delay in some cases from when the user presses save to when he will see the new record, but in most cases it should be almost synchronous.
Another (good?) option is to assume/design commands that don't fail. This is not trivial but you can have a cache on the client and add the data from the command to that cache and display it to the user even before the command has been executed. If the command fails for some unexpected situation, well then just design a good "we are sorry" message for misleading the user for a few seconds.
You can also combine the methods above.
Usually eventual consistency is more of a business/domain problem, and you should have your domain experts handle it.
I think that other answers mix together CQRS in general and eventual consistency in particular. Task-based UI is very suitable for CQRS but it does not resolve the issue with eventually consistent read model.
First, I would like to challenge your statement:
Users are used to click save, and see the result instantaneously... with eventual consistency it's not possible.
What do you by this? Why is it not possible to see the result immediately? I think the issue here is your definition of result.
The result of any action is that that action has been performed. There are numerous of ways to show this! It depends on what kind of action do you want to complete. Examples:
Send an email: if user has entered a correct email address, it is almost guaranteed that the action will complete successfully. To prevent unexpected failures one might use durable queues since this kind of actions do not need to be done synchronously. So you just say "email sent". Typically you see this kind of response when you ask to reset your password.
Update some information in a user profile: after you have validated the new data on the client, most probably the command will succeed too since the only thing that could happen is the database error (if you use database). Again, even this can be mitigated by using durable queues. In this case you just show the updated field in the same form. The good practice for SPA is to have a comprehensive data store on the client side, like Redux does. In this case you can safely update the server by sending a command and also updating the client-side store, which will result in UI to shows the latest data. Disclaimer: some answers refer to this technique as "tricking the user", but I disagree with this definition.
If you have commands that are prone to error, you can use techniques that are already described in other answers like Websockets or Server-side events to communicate errors back. This requires quite a lot of additional work. You can also send a command and wait for reply or execute commands synchronously. Some would say "this is not CQRS" but this would be just another dogma to be challenged. Ensuring the command has completed the execution in combination with the previous point (client-side data store) will be a good solution.
I am not sure if there is any 100% bullet proof technique that allows you to always show non-stale data from the read model. I think it goes against the principles of CQRS. Even with real-time events you will only get events that indicate that you write model has been updated. Still, your projections could have failed and reacting on this is a whole other story.
However, I would not concentrate that much on this issue. The fact is that well-tested projections and almost-guaranteed commands will work very well. For error handling in 90% of situations it is enough to have some manual or half-manual process to recover from those errors. For the last 10% you can combine generic "error" messages pushed from the server saying "sorry, your action XXX has failed to execute" and the top priority actions could have some creative process behind them but in reality those situations would be very very rare.
There are 2 ways:
To trick a user (just to show that things has happened then they
really hasn't happened yet)
Show that system is processing request
and use polling in background (not good) or just timer with value of
your SLA.
I prefer the 1st option.
As someone has already mentioned, task based UI's fit well for this, and what I would do is employ a technique that 'buys you time' for the command to propagate.
For example, imagine we are on a list screen, where the user can perform various actions, one of which being to add a new item to the list. After choosing to add an item you could display a "What would you like to do next?" which could have 'Add another item', 'Do this task', 'Do some other task', 'Go back to list'.
By the time they have clicked on an option, the data would have hopefully been refreshed.
Also, if you're using a task based UI, you can analyse the patterns of task execution and use these "what would you like to do next" screens to streamline the UI. Similar to amazon's "other people also bought these items".
As previously stated, it is fine to tell the user that the request (command) has been acknowledged (successfully issued). In case of some failure, the system should communicate this to the requester, by means of:
email;
SMS;
custom inbox (e.g. like the SO inbox);
whatever.
E.g., mail client / service:
I am sending a mail to a wrong address;
the mail service says: "email sent successfully :)";
after few minutes, I receive a mail from the service: "email could not be delivered".
I believe a great way to inform the user about a recent failure is to present him an error panel while he's navigating through the application. A user gesture might be required in order to dismiss that alert etc.
For example:
I wouldn't go with tricking the user or blocking him from committing some other actions. I would rather go for streaming data toward UI after they are being acknowledged by a read side. Let's consider these two cases:
Users saves data and expects result. Connection is established toward server. After they are being acknowledged by a read side, they are streamed toward UI and UI is being updated.
User saves data and refreshes web page. Upon reload, data are being fetched from data store and connection for streaming is established. If read side didn't update the data store in the meantime, there's still an opened stream and UI should be updated after data reaches the read side.
Why streaming from read side and not directly from write side? Simply, that would be a confirmation that read side has been reached.
From technical aspect, Server-Sent Events could be used.
Disadvantage:
Results will still not be reflected immediately by a read side. But at least, in most cases, user will be able to continue with his work without being blocked by a UI.
There are several ways to handle eventual consistency. All of them are really to occupy the time from the User's action until the backend refresh.
User Reads A given user can only read from the same database node that they write to. Other users read from the replicated nodes. PROS: UI is quick enough, and application stays in sync. CONS: Your service architecture has to track and route Users to specific database nodes.
Disable the UI until the action has completed, and refresh it. Java Server Faces has a classic example of this. One could create a modal with a loading spinner to cover the UI until the refresh was completed. PROS: UI stays in sync with application state. CONS: Most every action creates a blocked UI. Users get very frustrated by the restricted UI, and will complain of application slowness.
Confirmation Immediately thank the user for their submission. Then let them know later (email, SMS, in-app notification) whether or not the action was completed. PROS: It's fast up front. CONS: UI lags behind system until refresh. Even with a notice, the User may get confused that they don't see the updates. It also requires integration of various communication channels. Users won't see their changes right away. If the action fails, they may not know until it's too late.
Fake it Optimistically assume that the action will complete. Show the User the resulting UI (upvote, comment, credit card confirmation, etc) and allow them to continue as if it succeeded. If there were failures, immediately show them as contextual errors: alerts next to the undone upvotes, in-app alert on the post with the failed comment, email for the declined credit card. PROS: UI feels much faster. CONS: UI is temporarily out of sync with application state, and you must resolve that. One case: you might fake creation of content with temp IDs. But after content is created, then the temp IDs will be wrong until the refresh. Second case, you might need to store all state changes on the UI after the action until the refresh. Then you need some Resolver to apply all the local state changes since the action was issued. This is resolution is non-trivial.
Web Sockets Subscribe the UI to an event stream so that when the action is completed on the backend, it is pushed to the front end. Is it one-way or two-way streaming? PROS: UI feels fast, and it's in sync with the application state. CONS: Consistent browser support, need a backend source of streaming events, and socket server scalability.

catching Exceeded maximum execution time?

Is there anyway of catching this GAS error: "Exceeded maximum execution time"
I mean catching with try ... catch(e) // so far it's not working for me.
Thanks
As written in the comments to your question, thats not possible. But, however, you can set a flag in scriptDB or properties when execution starts and clear that flag when execution comes to a normal end, so you can find out during the next run wether your script came to a regular end when it was run last time and try to take corrective actions if not.
The answer above is correct; it's not possible. An easy alternative to the workaround that pbhd mentioned would be to simply track the runtime of your script (e.g. comparing results of new Date().getTime() at regular intervals) and run whatever you'd include under your catch statement right before you hit the maximum execution time. The maximum is 6 minutes (reference).
That way, you don't have to catch the error -- you can preempt it.
During normal testing, it is possible to accidentally create an infinite (or very long running) loop that consumes the daily execution time limit 100%.
Even if you realize what you have dome wrong immediately, you cannot immediately re-try with Google scripts for another 24 hours - thus slowing down ongoing development significantly and maybe forcing the developer to do some other work , taking his focus/"attention stream" away from the current problem. This is almost always bad.
My product ("IBM OLIVER CICS test/debug" - see Wikipedia article) solved this problem - and many others - around 37 years ago - by having a time limit on any particular transaction and intercepting the resulting time out, allowing options of:-
continuation or
examine/modify variables
"manual" re-try (for the same time) or
abort.
Google could implement this approach just as easily - by "pausing" if execution time is looking too heavy. I had a similar solution to other resources in OLIVER - such as excessive API calls ("possible macro loop") and excessive memory usage.
It seems it takes an "old timer" like me to provide solutions to problems that have existed "since the beginning of time" (and certainly before PC's were thought of).
Googles current "solution" (i.e. absolute limits) only helps Google keep its own servers from being swamped. It would be easy for them to do what OLIVER did all those years ago. By the way there should be no "IBM" prefix on the Wikipedia article - it was my own product and some clown Wikipedia editor altered it to include the prefix.
(By the way, Google do not prevent other scripts on same s/s running - that maybe only use minimal amounts of extra time ( i.e. Scripts on the same spreadsheet still work) . I tried renaming the original script as an experiment but it was stopped after a very short time with "exceeded execution time" error.
GIZ-A-JOB Google - you know its worth it!