Google Cloud Functions timeout limit changes - google-cloud-functions

I have a Google Cloud function that was working for the last weeks. I removed it and try to deploy, getting this error:
INVALID_ARGUMENT: The timeout for functions with an event trigger cannot exceed 540 seconds.
This was the command used to deploy and update (and it was working until today):
gcloud functions deploy import-XXXXXXX-function \
--gen2 \
--runtime=go119 \
--memory=128Mi \
--timeout=t30m \
--region=$REGION \
--source="$ROOT" \
--entry-point=ImportXXXXXXX \
--trigger-event-filters="type=google.cloud.storage.object.v1.finalized" \
--trigger-event-filters="bucket=$BUCKET" \
--set-env-vars=STAGE=$STAGE
I can see that the documentation was updated days ago (Last updated 2023-02-02 UTC.) and now the max timeout for event-driven functions is 540 seconds.
So two question:
My job is going to process a files that sometimes is taking around 15 minutes. What to do now?
How to verify that the timeout limit was the latest change in the doc?

The timeout has always been 540 seconds. See (and note the dates):
Google cloud function max timeout 540 seconds not working for client
Run FIrebase Cloud Function for more than 9 mins (540 sec)
If you need more time for a background/async execution, you can either break up your function into different parts, or you can use a different product, such as Cloud Run. There is a full example of how to do that.
You could also arrange to use App Engine or Compute Engine to stand up a server.

Related

Google Earth Engine Python API Warning

Yesterday, I created a script to get Sentinel-1 images for a specific region and the code ran as expected. Today, running the same code with no changes I get the following warning:
WARNING:googleapiclient.http:Sleeping 0.91 seconds before retry 1 of 5 for request: POST https://earthengine.googleapis.com/v1alpha/projects/earthengine-legacy/value:compute?prettyPrint=false&alt=json, after 500
I am wondering if this is a quota issue? Or some other issue?

Cloud Build fails due to Cloud Function container not being ready yet

we have a Google cloudbuild pipeline that does the following:
Create a temp function
Run some tests
Deploy the production function
However, the second step often fails due to the first container not being ready yet:
Step #3: Unexpected token '<' at 2:1
Step #3: <html><head>
Step #3: site unreachable
looks like it is returning some placeholder html from nginx.
How can we fix that?
Currently, we just put an ugly sleep between steps
You probably want to have a look to the Cloud Functions API, there you can find the operations endpoint that will tell you if the operation is finished or not (assuming v1, otherwise look below): https://cloud.google.com/functions/docs/reference/rest/v1/operations/get
The operation is the same Id that is returned in the creation operation. Also you can list them with the list endpoint (in the same doc).

Save a model weights when a program receives TIME LIMIT while learning on a SLURM cluster

I use a deep learning models written in pytorch_lightning (pytorch) and train them on slurm clusters. I submit job like this:
sbatch --gpus=1 -t 100 python train.py
When requested GPU time ends, slurm kills my program and shows such message:
Epoch 0: : 339it [01:10, 4.84it/s, loss=-34] slurmstepd: error: *** JOB 375083 ON cn-007 CANCELLED AT 2021-10-04T22:20:54 DUE TO TIME LIMIT ***
How can I configure a Trainer to save model when available time end?
I know about automatic saving after each epoch, but I have only one long epoch that lasts >10 hours, so this case is not working for me.
You can use Slurm's signalling mechanism to pass a signal to your application when it's within a certain number of seconds of the timelimit (see man sbatch). In your submission script use --signal=USR1#30 to send USR1 30 seconds before the timelimit is reached. Your submit script would contain these lines:
#SBATCH -t 100
#SBATCH --signal=USR1#30
srun python train.py
Then, in your code, you can handle that signal like this:
import signal
def handler(signum, frame):
print('Signal handler got signal ', signum)
# e.g. exit(0), or call your pytorch save routines
# enable the handler
signal.signal(signal.SIGUSR1, handler)
# your code here
You need to call your Python application via srun in order for Slurm to be able to propagate the signal to the Python process. (You can probably use --signal on the command line to sbatch, I tend to prefer writing self-contained submit scripts :))
Edit: This link has a nice summary of the issues involved with signal propagation and Slurm.

ESPv2 with Google Cloud Functions upstream request timeout

I'm having issue getting answers from a Google Coud Function going through ESPv2.
Every time I request it, I get a response 15 seconds later with a status code of 504.
My function take between 30 to 45 seconds.
In the logs the functions correctly and answer back after 35 seconds.
Is there a way to increase the timeout in ESPv2 ?
Thanks
For anyone else having this issue, in the openapi-functions.yaml under the x-google-backend you should had the attribut deadline and set it to whatever value in seconds you want.
Here is the hidden documentation https://cloud.google.com/endpoints/docs/openapi/openapi-extensions#deadline
Issue related: https://github.com/GoogleCloudPlatform/esp-v2/issues/4
Depending on the documentation you used to secure the endpoints of your Cloud Function’s with ESPv2, this should be possible. If you are using Cloud Run to host your ESPv2, a 504 error is sent when a request exceeds its request timeout limit. The request timeout limit is a setting that specifies the time within which a response must be returned before sending a 504 response. You can change this value by going in your “Cloud Run” tab, selecting your ESPv2 service, selecting “Edit & Deploy new Revision”, scrolling down to the capacity section and setting the time in milliseconds. This is some documentation that could prove helpful when working with the topics discussed.

Google Cloud SQL No Response

We are running a Sails.js API on Google Container Engine with a Cloud SQL database and recently we've been finding some of our endpoints have been stalling, never sending a response.
I had a health check monitoring /v1/status and it registered 100% uptime when I had the following simple response;
status: function( req, res ){
res.ok('Welcome to the API');
}
As soon as we added a database query, the endpoint started timing out. It doesn't happen all the time, but seemingly at random intervals, sometimes for hours on end. This is what we have changed the query to;
status: function( req, res ){
Email.findOne({ value: "someone#example.com" }).then(function( email ){
res.ok('Welcome to the API');
}).fail(function(err){
res.serverError(err);
});
}
Rather suspiciously, this all works fine in our staging and development environments, it's only when the code is deployed in production that the timeout occurs and it only occurs some of the time. The only thing that changes between staging and production is the database we are connecting to and the load on the server.
As I mentioned earlier we are using Google Cloud SQL and the Sails-MySQL adapter. We have the following error stacks from the production server;
AdapterError: Invalid connection name specified
at getConnectionObject (/app/node_modules/sails-mysql/lib/adapter.js:1182:35)
at spawnConnection (/app/node_modules/sails-mysql/lib/adapter.js:1097:7)
at Object.module.exports.adapter.find (/app/node_modules/sails-mysql/lib/adapter.js:801:16)
at module.exports.find (/app/node_modules/sails/node_modules/waterline/lib/waterline/adapter/dql.js:120:13)
at module.exports.findOne (/app/node_modules/sails/node_modules/waterline/lib/waterline/adapter/dql.js:163:10)
at _runOperation (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/operations.js:408:29)
at run (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/operations.js:69:8)
at bound.module.exports.findOne (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/basic.js:78:16)
at bound [as findOne] (/app/node_modules/sails/node_modules/lodash/dist/lodash.js:729:21)
at Deferred.exec (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:501:16)
at tryCatcher (/app/node_modules/sails/node_modules/waterline/node_modules/bluebird/js/main/util.js:26:23)
at ret (eval at <anonymous> (/app/node_modules/sails/node_modules/waterline/node_modules/bluebird/js/main/promisify.js:163:12), <anonymous>:13:39)
at Deferred.toPromise (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:510:61)
at Deferred.then (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:521:15)
at Strategy._verify (/app/api/services/passport.js:31:7)
at Strategy.authenticate (/app/node_modules/passport-local/lib/strategy.js:90:12)
at attempt (/app/node_modules/passport/lib/middleware/authenticate.js:341:16)
at authenticate (/app/node_modules/passport/lib/middleware/authenticate.js:342:7)
at Object.AuthController.login (/app/api/controllers/AuthController.js:119:5)
at bound (/app/node_modules/sails/node_modules/lodash/dist/lodash.js:729:21)
at routeTargetFnWrapper (/app/node_modules/sails/lib/router/bind.js:179:5)
at callbacks (/app/node_modules/sails/node_modules/express/lib/router/index.js:164:37)
Error (E_UNKNOWN) :: Encountered an unexpected error :
Could not connect to MySQL: Error: Pool is closed.
at afterwards (/app/node_modules/sails-mysql/lib/connections/spawn.js:72:13)
at /app/node_modules/sails-mysql/lib/connections/spawn.js:40:7
at process._tickDomainCallback (node.js:381:11)
Looking at the errors alone, I'd be tempted to say that we have something misconfigured. But the fact that it works some of the time (and has previously been working fine!) leads me to believe that there's some other black magic at work here. Our Cloud SQL instance is D0 (though we've tried upping the size to D4) and our activation policy is "Always On".
EDIT: I had seen others complain about Google Cloud SQL eg. this SO post and I was suspicious but we have since moved our database to Amazon RDS and we are still seeing the same issues, so it must be a problem with sails and the mysql adapter.
This issue is leading to hours of downtime a day, we need it resolved, any help is much appreciated!
This appears to be a sails issue, and not necessarily related to Cloud SQL.
Is there any way the QPS limit for Google Cloud SQL is being reached? See here: https://cloud.google.com/sql/faq#sizeqps
Why is my database instance sometimes slow to respond?
In order to minimize the amount you are charged for instances on per use billing plans, by default your instance becomes passive if it is not accessed for 15 minutes. The next time it is accessed there will be a short delay while it is activated. You can change this behavior by configuring the activation policy of the instance. For an example, see Editing an Instance Using the Cloud SDK.
It might be related to your policy setting. If you set it to ON_DEMAND, the instance will sleep to save your budget so that the first query to activate the instance is slow. This might cause the timeout.
https://cloud.google.com/sql/faq?hl=en