Google Drive SDK - 500: Internal Server error: File uploads successfully most of the time - google-drive-api

The Google Drive REST API sometimes returns a 500: Internal Server Error when attempting to upload a file. Most of these errors actually correspond to a successful upload. We retry the upload as per Google's recommendations only to see duplicates later on.
What is the recommended way of handing these errors?

Google's documentation seems to indicate that this is an internal error of theirs, and not a specific error that you can fix. They suggest using exponential backoff, which is basically re-attempting the function at increasing intervals.
For example, the function fails. Wait 2 seconds and try again. If that fails, wait 4 seconds. Then 8 seconds, 16, 32 etc. The bigger gaps mean that you're giving more and more time for the service to right itself. Though depending on your need you may want to cap the time eventually so that it waits a maximum of 10 minutes before stopping.
The retrying package has a very good set up for this. You can just from retrying import retry and then use retry as a decorator on any function that should be re-attempted. Here's an example of mine:
#retry(wait_exponential_multiplier=1000, wait_exponential_max=60*1000, stop_max_delay=10*60*1000)
def find_file(name, parent=''):
...
To use the decorator you just need to put #retry before the function declaration. You could just use retry() but there are optional parameters you can pass to adjust how the timing works. I use wait_exponential_multiplier to adjust the increase of waiting time between tries. wait_exponential_max is the maximum time it can spend waiting between attempts. And stop_max_delay is the time it will spend retrying before it raises the exception. All their values are in milliseconds.

Standard error handling is described here: https://developers.google.com/drive/handle-errors
However, 500 errors should never happen, so please add log information, and Google can look to debug this issue for you. Thanks.

Related

How can I tell what limit my cloud function hits when I get a generic "Error: could not handle the request" message?

I have deployed a cloud function that runs perfectly fine most of the time, but sometimes outputs this generic error.
The function generates PDF documents - sometimes generates them from HTML with Puppeteer (I think this part pretty much always works), sometimes combines other PDFs from invoking itself and loading other URLs into multi-page documents. I can very well imagine that it hits some kind of limit when those documents get long and complex - but I have set both the memory limit and the execution time limit as high as the service allows, and it still fails. Looking at the monitoring graphs, it seems neither execution time nor memory usage graphs are hitting the limits. So the question is: how can I figure out why it fails?
I moved the function to run on "2nd generation" of the cloud functions feature, where it was possible to grant it more memory. This fixed the issues and it now runs reliably. I still do not quite understand why the graphs under "monitoring" did not hit the indicated memory limits when it failed - would have made it immediately obvious what the problem was. Anyway, 2nd generation FTW.
I post this as an answer just to let anyone struggling with this problem know that the memory usage graphs may not tell the full story, and your cloud function may require more memory even if it does not seem to hit the roof.

What are the rate limits that throw the "Rate Limit Exceeded" error during an upload?

Case
I have this server-to-server file upload implementation that uses Google Drive on the other end. All of a sudden I've been seeing this intermittent error called Rate Limit Exceeded during scheduled file uploads.
Refactor and test
I know the error can be handled by batching the uploads and/or by doing exponential backoff based from official documentation. My concern is the actual rate limits so I did a test.
I restructured the code to make the uploads 1 file only for every 3 minutes.
Didn't work! - still getting the same errors and still happens intermittent.
Questions
Are there official figures as to maximum rate limits? How many requests per hour? Something like size-to-period ratio or number-of-requests-to-period ratio would really help.
What are the actual rate limits that throw/trigger the "Rate Limit Exceeded" error during a file upload?
You can check your current traffic usage from https://console.developers.google.com.
Some operations like create and update operations have additional internal limits that may be lower than your allowed QPS.
Depending on your use case, there are more specific things you can do (e.g. slow down on per-user operations, but compensate by doing more users in parallel to maximize throughput).
Also, the "403 Rate Limit Exceeded" errors that you are seeing may be occurring due to the number of read/write requests that you are performing per user per sec in your application. Please consider trying the following steps to reduce the errors:
You may need to optimize your code by reducing the number of api calls made simultaneously per user/sec.
Batch the requests.

Server side microsecond timing

Is there some API available for microsecond accurate timing? Some jitter is acceptable. Something equivalent to performance.now() is preferable.
In my 5 minutes of research I found the console object which does log times accurately enough but there is no easy way to retrieve those logged entries. Additionally I may call this timing function thousands of times which would clutter logs.

What does the timeout in nerve signify?

I'm trying to discover few services using nerve.
While I came across the timeout configuration specified in nerve docs.
timeout: (optional) maximum time the check can take; defaults to 100ms
However when I look at the examples provided, the timeout is mentioned as "0.2".
Does this mean the timeout for these examples are "0.2ms"? Is that even a valid configuration for timeout?
Or is 0.2 considered as 2 sec?
I went through the code for nerve and looks like the timeout configuration provided in nerve json would just be read this value and directly pass it to http client as read_timeout without any additional processing.
As per Ruby documentation, this value is in seconds.
So 0.2 means 200ms.
I'm assuming Nerve Docs was not updated or has a mistake w.r.t documentation.
read_timeout[R]
Number of seconds to wait for one block to be read
(via one read(2) call). Any number may be used, including Floats for
fractional seconds. If the HTTP object cannot read data in this many
seconds, it raises a Net::ReadTimeout exception. The default value is
60 seconds.

DynamoDB: handling throttling with boto

According to DynamoDB docs, requests causing database throttling are automatically retried if using the supported SDKs. However, I was unable to find any mention about how boto handles throttling cases. Does boto automatically retry throttled requests or should I start catching ProvisionedThroughputExceededException?
Boto does automatically retry ProvisionedThroughputExceededException errors. There is a special retry handler in the boto.dynamodb.layer1 module that handles this. It uses shorter wait intervals and retries a maximum of 10 times. After that, it throws a DynamoDBThroughputExceededError exception. The boto library also keeps track of the total number of ThroughputExceededErrors that are caught in the attribute throughput_exceeded_events of the Layer1 object.