What are the rate limits that throw the "Rate Limit Exceeded" error during an upload? - google-drive-api

Case
I have this server-to-server file upload implementation that uses Google Drive on the other end. All of a sudden I've been seeing this intermittent error called Rate Limit Exceeded during scheduled file uploads.
Refactor and test
I know the error can be handled by batching the uploads and/or by doing exponential backoff based from official documentation. My concern is the actual rate limits so I did a test.
I restructured the code to make the uploads 1 file only for every 3 minutes.
Didn't work! - still getting the same errors and still happens intermittent.
Questions
Are there official figures as to maximum rate limits? How many requests per hour? Something like size-to-period ratio or number-of-requests-to-period ratio would really help.
What are the actual rate limits that throw/trigger the "Rate Limit Exceeded" error during a file upload?

You can check your current traffic usage from https://console.developers.google.com.
Some operations like create and update operations have additional internal limits that may be lower than your allowed QPS.
Depending on your use case, there are more specific things you can do (e.g. slow down on per-user operations, but compensate by doing more users in parallel to maximize throughput).
Also, the "403 Rate Limit Exceeded" errors that you are seeing may be occurring due to the number of read/write requests that you are performing per user per sec in your application. Please consider trying the following steps to reduce the errors:
You may need to optimize your code by reducing the number of api calls made simultaneously per user/sec.
Batch the requests.

Related

How can I tell what limit my cloud function hits when I get a generic "Error: could not handle the request" message?

I have deployed a cloud function that runs perfectly fine most of the time, but sometimes outputs this generic error.
The function generates PDF documents - sometimes generates them from HTML with Puppeteer (I think this part pretty much always works), sometimes combines other PDFs from invoking itself and loading other URLs into multi-page documents. I can very well imagine that it hits some kind of limit when those documents get long and complex - but I have set both the memory limit and the execution time limit as high as the service allows, and it still fails. Looking at the monitoring graphs, it seems neither execution time nor memory usage graphs are hitting the limits. So the question is: how can I figure out why it fails?
I moved the function to run on "2nd generation" of the cloud functions feature, where it was possible to grant it more memory. This fixed the issues and it now runs reliably. I still do not quite understand why the graphs under "monitoring" did not hit the indicated memory limits when it failed - would have made it immediately obvious what the problem was. Anyway, 2nd generation FTW.
I post this as an answer just to let anyone struggling with this problem know that the memory usage graphs may not tell the full story, and your cloud function may require more memory even if it does not seem to hit the roof.

EsRejectedExecutionException in elasticsearch for parallel search

I am querying elasticsearch for multiple parallel requests using single transport client instance in my application.
I got the below exception for the parallel execution. How to overcome the issue.
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#5f804c60
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:441)
at org.elasticsearch.action.search.type.TransportSearchScanAction$AsyncAction.sendExecuteFirstPhase(TransportSearchScanAction.java:68)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:171)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:153)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:52)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:42)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:107)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:124)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:113)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:212)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:109)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Elasticsearch has a thread pool and a queue for search per node.
A thread pool will have N number of workers ready to handle the requests. When a request comes and if a worker is free , this is handled by the worker. Now by default the number of workers is equal to the number of cores on that CPU.
When the workers are full and there are more search requests, the request will go to queue. The size of queue is also limited. If by default size is, say, 100 and if there happens more parallel requests than this, then those requests would be rejected as you can see in the error log.
Solutions:
The immediate solution for this would be to increase the size of
the search queue. We can also increase the size of threadpool,
but then that might badly affect the performance of individual
queries. So, increasing the queue might be a good idea. But then
remember that this queue is memory residential and increasing the
queue size too much can result in Out Of Memory issues. (more
info)
Increase number of nodes and replicas - Remember each node has its
own search threadpool/queue. Also, search can happen on primary
shard OR replica.
Maybe it sounds strange, but you need to lower the parallel searches count. With that exception, Elasticsearch tells you that you are overloading it. There are some limits (at thread count level) that are set in Elasticsearch and, most of the times, the defaults for these limits are the best option. So, if you are testing your cluster to see how much load it can hold, this would be an indicator that some limits have been reached.
Alternatively, if you really want to change the default you can try increasing the queue size for searches to accommodate the concurrency demands, but keep in mind that the larger the queue size, the more pressure you put on your cluster that, in the end, will cause instability.
I saw this same error because I was sending lots of indexing requests in to ES in parallel. Since I'm writing a data migration, it was easy enough to make them serial, and that resolved the issue.
I don't know what was your node configuration but your queue size (1000) is already on a higher side. As others have explained already, your search requests are queued in the Elasticsearch thread pool queue. Even after such a high queue size, if you are getting rejections, that gives some hint that you need to revisit your query pattern.
Like many other designs, even in this case, there is no one-size-fits-all solution. I found this is a very good post about how this queue works and different ways to do a performance test to find out what suits best for your use case.
HTH!

What are the specific quotas for Properties Service in Google Apps for an Add-On?

We have an Add-On that uses DocumentProperties and UserProperties for persistence. We've recently had some users complain about receiving the error: "Service invoked too many times for one day: properties". I spent some time optimizing usage of the Properties Service when developing the Add-On to ensure this doesn't happen, but it still seems to in some situations.
According to https://developers.google.com/apps-script/guides/services/quotas, there is a quota of 50,000 "set" calls per day per user or document. There does not appear to be a limit on the number of reads.
Here's how our app utilizes properties:
We only set properties when values change. Even with very heavy usage, I still can't imagine more than 500 property "set" calls per day per user. When we set properties, we also write to the Cache Service with a 6 hour timeout.
We read the properties when the user is using the add-on, and also every 10 seconds while the add-on is idling. That comes out to 8640 reads per document per day. However, we use the Cache Service for reads, so very few of those reads should hit the Properties Service. I did discover a bug that existed when the most recent bug report came in where we don't re-write to the Cache even after it expires until an explicit change is made. By my calculations, that leads to 6 hours with 1 read, and then 18 hours of 6 reads/min * 60 min/hr, or 6480 reads. Factor in some very heavy usage, and we're still at about 7000 reads per day per document. The user claims he had two copies of the document open, so 14000 reads. That's still far below the 50000 quota, not to mention that the quota seems to be for setting properties, not reading them.
Would anyone from Google be able to offer some insight into any limits on Properties reads, or give any advice on how to avoid this situation going forward?
Thanks much!
This 50,000 writes per day quota is shared across all the scripts. This means that if a particular script makes heavy use of writes, it could negatively impact other scripts that are installed by the same user.

Handling lots of req / sec in go or nodejs

I'm developing a web app that needs to handle bursts of very high loads,
once per minute I get a burst of requests in very few seconds (~1M-3M/sec) and then for the rest of the minute I get nothing,
What's my best strategy to handle as many req /sec as possible at each front server, just sending a reply and storing the request in memory somehow to be processed in the background by the DB writer worker later ?
The aim is to do as less as possible during the burst, and write the requests to the DB ASAP after the burst.
Edit : the order of transactions in not important,
we can lose some transactions but 99% need to be recorded
latency of getting all requests to the DB can be a few seconds after then last request has been received. Lets say not more than 15 seconds
This question is kind of vague. But I'll take a stab at it.
1) You need limits. A simple implementation will open millions of connections to the DB, which will obviously perform badly. At the very least, each connection eats MB of RAM on the DB. Even with connection pooling, each 'thread' could take a lot of RAM to record it's (incoming) state.
If your app server had a limited number of processing threads, you can use HAProxy to "pick up the phone" and buffer the request in a queue for a few seconds until there is a free thread on your app server to handle the request.
In fact, you could just use a web server like nginx to take the request and say "200 OK". Then later, a simple app reads the web log and inserts into DB. This will scale pretty well, although you probably want one thread reading the log and several threads inserting.
2) If your language has coroutines, it may be better to handle the buffering yourself. You should measure the overhead of relying on our language runtime for scheduling.
For example, if each HTTP request is 1K of headers + data, want to parse it and throw away everything but the one or two pieces of data that you actually need (i.e. the DB ID). If you rely on your language coroutines as an 'implicit' queue, it will have 1K buffers for each coroutine while they are being parsed. In some cases, it's more efficient/faster to have a finite number of workers, and manage the queue explicitly. When you have a million things to do, small overheads add up quickly, and the language runtime won't always be optimized for your app.
Also, Go will give you far better control over your memory than Node.js. (Structs are much smaller than objects. The 'overhead' for the Keys to your struct is a compile-time thing for Go, but a run-time thing for Node.js)
3) How do you know it's working? You want to be able to know exactly how you are doing. When you rely on the language co-routines, it's not easy to ask "how many threads of execution do I have and what's the oldest one?" If you make an explicit queue, those questions are much easier to ask. (Imagine a handful of workers putting stuff in the queue, and a handful of workers pulling stuff out. There is a little uncertainty around the edges, but the queue in the middle very explicitly captures your backlog. You can easily calculate things like "drain rate" and "max memory usage" which are very important to knowing how overloaded you are.)
My advice: Go with Go. Long term, Go will be a much better choice. The Go runtime is a bit immature right now, but every release is getting better. Node.js is probably slightly ahead in a few areas (maturity, size of community, libraries, etc.)
How about a channel with a buffer size equal to what the DB writer can handle in 15 seconds? When the request comes in, it is sent on the channel. If the channel is full, give some sort of "System Overloaded" error response.
Then the DB writer reads from the channel and writes to the database.

Google Drive SDK - 500: Internal Server error: File uploads successfully most of the time

The Google Drive REST API sometimes returns a 500: Internal Server Error when attempting to upload a file. Most of these errors actually correspond to a successful upload. We retry the upload as per Google's recommendations only to see duplicates later on.
What is the recommended way of handing these errors?
Google's documentation seems to indicate that this is an internal error of theirs, and not a specific error that you can fix. They suggest using exponential backoff, which is basically re-attempting the function at increasing intervals.
For example, the function fails. Wait 2 seconds and try again. If that fails, wait 4 seconds. Then 8 seconds, 16, 32 etc. The bigger gaps mean that you're giving more and more time for the service to right itself. Though depending on your need you may want to cap the time eventually so that it waits a maximum of 10 minutes before stopping.
The retrying package has a very good set up for this. You can just from retrying import retry and then use retry as a decorator on any function that should be re-attempted. Here's an example of mine:
#retry(wait_exponential_multiplier=1000, wait_exponential_max=60*1000, stop_max_delay=10*60*1000)
def find_file(name, parent=''):
...
To use the decorator you just need to put #retry before the function declaration. You could just use retry() but there are optional parameters you can pass to adjust how the timing works. I use wait_exponential_multiplier to adjust the increase of waiting time between tries. wait_exponential_max is the maximum time it can spend waiting between attempts. And stop_max_delay is the time it will spend retrying before it raises the exception. All their values are in milliseconds.
Standard error handling is described here: https://developers.google.com/drive/handle-errors
However, 500 errors should never happen, so please add log information, and Google can look to debug this issue for you. Thanks.