Documents List 3.0 changes feed API calls seem to fail ~5% of the time

Documents List 3.0 changes feed API calls seem to fail ~5% of the time - google-drive-api

I keep hitting this error a low though non negligible percentage of the time ( low single digits in my testing ).
I am positive it isn't my code since I was able to repro this with a test case of just the same API call to check the changes feed without any changes to the data set used for my account.
<errors xmlns='http://schemas.google.com/g/2005'><error><domain>GData</domain><code>ServiceException</code><internalReason>An unknown error has occurred.</internalReason></error></errors>
How does one go about debugging this issue given that there is no further detail in the error message ?
Even though it happens ~5% of the time, it is high enough that the failure is noticeable.
I see very similar behavior with drive.changes.list(), which again seems to fail at a comparable rate.

Related

When running the DAML sandbox an error occurs

The following error occurs when running the sandbox:
io.grpc.netty.NettyServerHandler onStreamError
WARNING: Stream Error
io.netty.handler.codec.http2.Http2Exception$HeaderListSizeException: Header size exceeded max allowed size (8192)
What could the cause of this be?

I have seen this error numerous times, and it is a consequence of having a transaction failure in a complex DAML model/transaction when running on the Sandbox. When you experience a transaction failure (fetch/exercise an inactive contract, lookupByKey returned a stale cid, head [], divide-by-zero, etc) the sandbox helpfully tries to provide transaction trace information in the error result.
This is normally fine for relatively simple models. With more complex models this trace can exceed the maximum header size producing the error you see. When this happens I have found the trace in the sandbox.log file, sometimes along with other errors that help explain what is going on.
The trace is an unformatted dump, so it can take a bit of effort to decode manually, but I have done it many times myself and the information I needed to identify the issue has always been there —— and to be honest, generally just knowing the choice I was exercising + the specific class of error is normally enough to point me in the right direction.
I believe there is some tooling being built to help with this sort of diagnosis; however, I don't know how advanced the work on that is.

Set stackdriver alerts for specific error messages

Cannot find a clean way to set Stackdriver alert notifications on errors in cloud functions
I am using a cloud function to process data to cloud data store. There are 2 types of errors that I want to be alerted on:
Technical exceptions which might cause function to 'crash'
Custom errors that we are logging from the cloud function
I have done the below,
Created a log metric searching for specific errors (although this will not work for 'crash' as the error message can be different each time)
Created an alert for this metric in Stackdriver monitoring with parameters as in below code section
This is done as per the answer to the question,
how to create alert per error in stackdriver
For the first trigger of the condition I receive an email. However, on subsequent triggers lets say on the next day, I don't. Also the incident is in 'opened' state.
Resource type: cloud function
Metric:from point 2 above
Aggregation: Aligner: count, Reducer: None, Alignment period: 1m
Configuration: Condition triggers if: Any time series violates, Condition:
is above, Threshold: 0.001, For: 1 min
So I have 3 questions,
Is this the right way to do to satisfy my requirement of creating alerts?
How can I still receive alert notifications for subsequent errors?
How to set the incident to 'resolved' either automatically/ manually?

I was having a similar problem and managed to at least get a mail every time. The "trick" seems to be to use sum instead of count in combination with for most recent value - see the screenshot below.
This causes Stackdriver to send a mail everytime a matching log entry is found and closing the issue a minute later.

Normally, alerts resolve themselves once the alerting policy stops firing. The problem you're having with your alerts not resolving is because your metric only writes non-zero points - if there are no errors, it doesn't write zero. That means that the policy never gets an unambiguous signal that everything is fine, so the alerts just sit there (they'll automatically close after 7 days, but I imagine that's not all that useful for you).
This is a common problem and it's a tricky one to solve. One possibility is to write your policy as a ratio of errors to something non-zero, like request count. As long as the request count is non-zero, the ratio will compute zero if there are no errors, and so an alert on the ratio will automatically resolve. You need to be a bit careful about rounding errors, though - if your request count is high enough, you might potentially miss a single error because the ratio could round to zero.
Aaron Sher, Stackdriver engineer

We got around this issue by having the insertId as a label of the log-based metric we created for every log record we get from the pods running our services.
In the alerting policy, this label helped in two things:
We grouped by it (named as record_id) which served in making each incident unique, so it got reported without waiting for other incidents to get resolved and at the same time it got resolved instantly.
We used it in the documentation of the notification to include a direct link to the issue (log record) itself which was a nice and essential feature to have. https://console.cloud.google.com/logs/viewer?project=MY_PROJECT&advancedFilter=insertId%3D%22${metric.label.record_id}%22
As #Aaron Sher mentioned in his answer, it is a tricky problem. We might have done something not recommended or not efficient, but it works fine and of course we are open for improvement recommendations.

Transaction's response time higher in Vugen than in browser

I am performance testing a map based web application, where a query gets fired onto the DB and in return a tabular data and map appears. Sort of like Google Maps. The problem i suppose is with rendering. While the time taken to actually "see" a table on the browser is around 1 minute, the same transaction takes around 3 minutes to complete in Vugen.
While advance tracing the logs, it shows a JSON response (of the above mentioned table) is getting downloaded. This response is of 6 Mb and is delaying the transaction to complete. I have been careful that no asynchronous calls are going along with this GET call and is actually covered with the lr_start_transaction and lr_end_transaction which is causing this high response time.
I understand that we might get better response over client side activities using TruClient protocol or others, but there is a restriction over it and Web HTTP protocol needs to be used.
I am using HP LR 12.02 version, over WinInet capture level.
My question is, is there any way i can actually emulate that "1 minute" time that a user would actually need to "see" the tabular data, rather than the 3 minutes it is taking. its okay if i disregard this JSON response and not download this 6 Mb data, if it makes any difference.
Any suggestion would be much appreciated. Thanks!

What are examples of real-world scenarios where a message queuing system can accept the loss of some messages?

I was reading this blog post, in which the author proposes the following question, in the context of message queues:
does it matter if a message is lost? If you application node, processing the request, dies, can you recover? You’ll be surprised how often it doesn’t actually matter, and you can function properly without guaranteeing all messages are processed
At first I thought that the main point of handling messages was to never loose a single message - after all, a message lost could mean a hotel reservation not booked, a checkout not completed, or any other functionality not carried through, which seems too similar to a bug for me. I suppose I am missing something, so, what are examples of scenarios where it is OK for a messaging system to loose a few messages?

Well, your initial expectation:
the main point of handling messageswas to never loose a single message
was just not a correct one.
Right, if one strives for a one certain type of robustness, where fail-safe measures have to take all due care and precautions, so as not a single message could get lost, yes, there your a priori expressed expectation fits.
This does not mean that all other system designs have to carry all the immense burdens and have to pay all that incurred costs ( resources-wise, latency-wise et al ), as the "100+% guaranteed delivery" systems do ( but, again, only if they can ).
Anti-pattern cases:
There are many use-cases, where an absolute certainty of delivery of each and every message originally sent is actually an anti-pattern.
Just imagine a weakly synchronised system ( including ones, that have nothing like backthrottling or even any simplest form of feedback propagation at all ), where the sensors read an actual temperature, a sound, a video-frame and send a message with that value(s).
Whenever a postprocessing system gets such information delivered, there may be a reason not to read any and all "old" values, but the most recent one(s).
If a delivery framework already got any newer set of values, all the "older" values, not processed yet, just hanging at some depth from the queue-head, yet in the queue, might create the anti-pattern, where one would not like to have to read and process any and all of those "older" values, but just the most recent one(s).
Like no one will make a trade with you based on yesterday prices, there is not positive value to make any new, current, decision based on reading any and all "old" temperature readings, that still wait in the queue.
Some smart-messaging frameworks provide explicit means for taking just the very "newest" message from a given source - thus enabling to imperatively discard any "older" messages, avoiding them from being read and processed right due to a known presence of a one "most" recent.
This answers the original question about the assumed main point of handling messages.
Efficiency first:
In any case, where a smart-delivery takes place ( either deliver an exact copy of the original message content or noting-at-all ), the resources are used at their best efforts, yet, without spending a single penny on anything but the "just-enough" smart-delivery.
Building robustness costs more than that.
Building an ultimate robustness, costs even way more than that.
Systems than do have such an extreme requirement can and may extend the resources-efficient smart-delivery so as to reach some requirements defined level of robustness, at some add-on costs.
The same but reversed is not possible -- if an "everything-proof" system is to get a slimmer form and fashion, so as to fit onto any restricted-resources hardware or to make it "forget" some "old" messages, that are of no positive value at this very moment ( but on the contrary, constitute a must for the processing element to read and process each and every "unwanted" message, just due to the fact it was delivered, while knowing a core-logic needs just the most recent one ).
Distributed systems accrue E2E-latency from many distributed sources, so any rigid-delivery system just block and penalise the only one element, who is ( latency-wise ) innocent -- the receiver.

I suppose it's OK to loose few messages from some measurement units that deliver the value once in.... Also for big data analytics solutions few lost messages won't make a big difference

It all depends on the application/larger system. The message queue is only one link in the chain, so to speak. If the application(s) at the ends are prepared to deal with loss, losing some messages is not a problem. If the application(s) rely on total messaging integrity then there will be problems.
An example of a system that will be ok with loss is weather updates for your phone. If a few temperature/wind updates don't make it to you there's no real harm in that.
Now, if you're running a nuclear reactor and you lose a few temperature updates on the core, well that is a problem.
I work a lot on safety critical, infrastructure-level systems, and am responsible for messaging much of the time. Many of those systems state clearly that messaging may reorder, duplicate, or lose messages; it's just a fact of life where distributed systems and networks are involved. The endpoint systems need to be designed to work correctly in that environment. So they track messages, ack end to end, deal with duplicates and retransmits, etc.

catching Exceeded maximum execution time?

Is there anyway of catching this GAS error: "Exceeded maximum execution time"
I mean catching with try ... catch(e) // so far it's not working for me.
Thanks

As written in the comments to your question, thats not possible. But, however, you can set a flag in scriptDB or properties when execution starts and clear that flag when execution comes to a normal end, so you can find out during the next run wether your script came to a regular end when it was run last time and try to take corrective actions if not.

The answer above is correct; it's not possible. An easy alternative to the workaround that pbhd mentioned would be to simply track the runtime of your script (e.g. comparing results of new Date().getTime() at regular intervals) and run whatever you'd include under your catch statement right before you hit the maximum execution time. The maximum is 6 minutes (reference).
That way, you don't have to catch the error -- you can preempt it.

During normal testing, it is possible to accidentally create an infinite (or very long running) loop that consumes the daily execution time limit 100%.
Even if you realize what you have dome wrong immediately, you cannot immediately re-try with Google scripts for another 24 hours - thus slowing down ongoing development significantly and maybe forcing the developer to do some other work , taking his focus/"attention stream" away from the current problem. This is almost always bad.
My product ("IBM OLIVER CICS test/debug" - see Wikipedia article) solved this problem - and many others - around 37 years ago - by having a time limit on any particular transaction and intercepting the resulting time out, allowing options of:-
continuation or
examine/modify variables
"manual" re-try (for the same time) or
abort.
Google could implement this approach just as easily - by "pausing" if execution time is looking too heavy. I had a similar solution to other resources in OLIVER - such as excessive API calls ("possible macro loop") and excessive memory usage.
It seems it takes an "old timer" like me to provide solutions to problems that have existed "since the beginning of time" (and certainly before PC's were thought of).
Googles current "solution" (i.e. absolute limits) only helps Google keep its own servers from being swamped. It would be easy for them to do what OLIVER did all those years ago. By the way there should be no "IBM" prefix on the Wikipedia article - it was my own product and some clown Wikipedia editor altered it to include the prefix.
(By the way, Google do not prevent other scripts on same s/s running - that maybe only use minimal amounts of extra time ( i.e. Scripts on the same spreadsheet still work) . I tried renaming the original script as an experiment but it was stopped after a very short time with "exceeded execution time" error.
GIZ-A-JOB Google - you know its worth it!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008