How to optimise calling "sub" Cloud Functions in parallel from an HTTP triggered function - google-cloud-functions

I have thousands of log files in a cloud storage bucket that I need to process and aggregate using an HTTP triggered cloud function and am looking for an approach to compute the task in the fastest possible way using parallelization.
At the moment, I have two cloud functions (nodejs 8):
The "main" function which a user is calling directly passing a list of log files that need to be processed; the function calls the "child" function for each provided log file that I also trigger with an HTTP request run parallel using async.each. The "child" function processes a single log file and returns the data to the "main" function which aggregates the results and, once all files are processed, sends the results back to the user.
If I call a child function directly, it takes about 1 second to complete a single file. I'd hope that if I call the main function to process 100 files in parallel the time will still be more or less 1 second. The first file in a batch is indeed returned after 1 second, but the time increases with every single file and the 100th file is returned after 7 seconds.
The most likely culprit is the fact that I'm running the child function using an HTTP request, but I haven't found a way to call them "internally". Is there another approach specific to Google Cloud Functions or maybe I can somehow optimise the parallelisation of HTTP requests?

The easiest approach is to simply share the code that does whatever the child function does, and invoke it directly from the main function. For some cases, it's simply easier and costs less due to fewer function invocations.
See also: Calling a Cloud Function from another Cloud Function

Related

Execution ID on Google Cloud Run

I am wondering if there exists an execution id into Cloud Run as the one into Google Cloud Functions?
An ID that identifies each invocation separately, it's very useful to use the "Show matching entries" in Cloud Logging to get all logs related to an execution.
I understand the execution process is different, Cloud Run allows concurrency, but is there a workaround to assign each log to a certain execution?
My final need is to group at the same line the request and the response. Because, as for now, I am printing them separately and if a few requests arrive at the same time, I can't see what response corresponds to what request...
Thank you for your attention!
Open Telemetry looks like a great solution, but the learning and manipulation time isn't negligible,
I'm going with a custom id created in before_request, stored in Flask g and called at every print().
#app.before_request
def before_request_func():
execution_id = uuid.uuid4()
g.execution_id = execution_id

Google Cloud Function: lazy loading not working

I deploy a google cloud function with lazy loading that loads data from google datastore. The last update time of my function is 7/25/18, 11:35 PM. It works well last week.
Normally, if the function is called less than about 30 minutes since last called. The function does not need to load data loaded from google datastore again. But I found that the lazy loading is not working since yesterday. Even the time between two function is less than 1 minute.
Does anyone meet the same problem? Thanks!
The Cloud Functions can fail due to several reasons such as uncaught exception and internal process crashes, therefore, it is required to check the logs files / HTTP responses error messages to verify the issue root cause and determine if the function is being restarted and generating Function execution timeouts that could explain why your function is not working.
I suggest you take a look on the Reporting Errors documentation that explains the process required to return a function error in order to validate the exact error message thrown by the service and return the error at the recommended way. Keep in mind that when the errors are returned correctly, then the function instance that returned the error is labelled as behaving normally, avoiding cold starts that leads higher latency issues, and making the function available to serve future requests if need be.

Data Studio connector making multiple calls to API when it should only be making 1

I'm finalizing a Data Studio connector and noticing some odd behavior with the number of API calls.
Where I'm expecting to see a single API call, I'm seeing multiple calls.
In my apps script I'm keeping a simple tally which increments by 1 every url fetch and that is giving me the correct number I expect to see with getData().
However, in my API monitoring logs (using Runscope) I'm seeing multiple API requests for the same endpoint, and varying numbers for different endpoints in a single getData() call (they should all be the same). E.g.
I can't post the code here (client project) but it's substantially the same framework as the Data Connector code on Google's docs. I have caching and backoff implemented.
Looking for any ideas or if anyone has experienced something similar?
Thanks
Per the this reference, GDS will also perform semantic type detection if you aren't explicitly defining this property for your fields. If the query is semantic type detection, the request will feature sampleExtraction: true
When Data Studio executes the getData function of a community connector for the purpose of semantic detection, the incoming request will contain a sampleExtraction property which will be set to true.
If the GDS report includes multiple widgets with different dimensions/metrics configuration then GDS might fire multiple getData calls for each of them.
Kind of a late answer but this might help others who are facing the same problem.
The widgets / search filters attached to a graph issue getData calls of their own. If your custom adapter is built to retrieve data via API calls from third party services, data which is agnostic to the request.fields property sent forward by GDS => then these API calls are multiplied by N+1 (where N = the amout of widgets / search filters your report is implementing).
I could not find an official solution for this either, so I invented a workaround using cache.
The graph's request for getData (typically requesting more fields than the Search Filters) will be the only one allowed to query the API Endpoint. Before starting to do so it will store a key in the cache "cache_{hashOfReportParameters}_building" => true.
if (enableCache) {
cache.putString("cache_{hashOfReportParameters}_building", 'true');
Logger.log("Cache is being built...");
}
It will retrieve API responses, paginating in a look, and buffer the results.
Once it finished it will delete the cache key "cache_{hashOfReportParameters}building", and will cache the final merged results it buffered so far inside "cache{hashOfReportParameters}_final".
When it comes to filters, they also invoke: getData but typically with only up to 3 requested fields. First thing we want to do is make sure they cannot start executing prior to the primary getData call... so we add a little bit of a delay for things that might be the search filters / widgets that are after the same data set:
if (enableCache) {
var countRequestedFields = requestedFields.asArray().length;
Logger.log("Total Requested fields: " + countRequestedFields);
if (countRequestedFields <= 3) {
Logger.log('This seams to be a search filters.');
Utilities.sleep(1000);
}
}
After that we compute a hash on all of the moving parts of the report (date range, plus all of the other parameters you have set up that could influence the data retrieved form your API endpoints):
Now the best part, as long as the main graph is still building the cache, we make these getData calls wait:
while (cache.getString('cache_{hashOfReportParameters}_building') === 'true') {
Logger.log('A similar request is already executing, please wait...');
Utilities.sleep(2000);
}
After this loop we attempt to retrieve the contents of "cache_{hashOfReportParameters}_final" -- and in case we fail, its always a good idea to have a backup plan - which would be to allow it to traverse the API again. We have encountered ~ 2% error rate retrieving data we cached...
With the cached result (or buffered API responses), you just transform your response as per the schema GDS needs (which differs between graphs and filters).
As you start implementing this, you`ll notice yet another problem... Google Cache is limited to max 100KB per key. There is however no limit on the amount of keys you can cache... and fortunately others have encountered similar needs in the past and have come up with a smart solution of splitting up one big chunk you need cached into multiple cache keys, and gluing them back together into one object when retrieving is necessary.
See: https://github.com/lwbuck01/GASs/blob/b5885e34335d531e00f8d45be4205980d91d976a/EnhancedCacheService/EnhancedCache.gs
I cannot share the final solution we have implemented with you as it is too specific to a client - but I hope that this will at least give you a good idea on how to approach the problem.
Caching the full API result is a good idea in general to avoid round trips and server load for no good reason if near-realtime is good enough for your needs.

Can lockservice for GAS work across multiple functions in the same project

I had a problem with a script I wrote, the solution to the issue was lock service to avoid collisions with form submits. As I had no idea this would be an issue, I've had to go back and revisit old scripts.
I have script that has a few different functions, and it passes data from one function to another. Eventually it writes data to sheet, creates a PDF, can email it and stores the PDF to a folder in google drive.
Here's a brief example of what I mean
function firstFunction() {
//Do stuff, return something
return something;
secondFunction(something);
}
function secondFunction(something) {
// Do stuff, return test
thirdFunction(test);
}
function thirdFunction(test) {
// Do stuff, return that
return that;
fourthFunction(that);
}
function fourthFunction(that){
// finish doing stuff. Write data
}
I also have a separate script, that would invoke the first and iterate through a list of data, to bulk produce PDFs.
I'm worried that if 2 people invoke the script at the same time, I'll have issues again.
Given the example script I've given, Do I have to use LockService on each function? Or can I declare the lock in the first function and release it in the last.
I'm also curious on how it would sit with the 2nd script that invokes the first several times. Would adding the lock service in this one be sufficient, or would I also have to add it to the second too?
Thanks in advance.
EDIT BELOW
I just remembered I posted the real code on Code Review for advice, and boy did I get some!!
Code Review Post
I should think that you don't need Lock Service in this case at all.
In the Lock Service documentation it states:
[Lock Service] Prevents concurrent access to sections of code. This can be useful when you have multiple users or processes modifying a shared resource and want to prevent collisions. (documentation: https://developers.google.com/apps-script/reference/lock/lock-service#getScriptLock%28%29)
or [Class Lock] is particularly useful for callbacks and triggers, where a user action may cause changes to a shared resource and you want to ensure that aren't collisions. (documentation: https://developers.google.com/apps-script/reference/lock/lock#tryLock%28Integer%29)
Now, having read the script code that you link to in your edit, I saw no shared resources that the script is writing to. So I conclude no lock is required. (EDIT: On second reading, I see that the script is writing to a sheet once, the shared resource. So your lock can go within that function only.)
I will cross post this point to Google Apps Script Plus community https://plus.google.com/communities/102471985047225101769 since there are experts there who can confirm.

Why is a google.script.run call being repeated?

I am calling a function on the server that adds a few hundred objects to the ScriptDB database from the client using google.script.run. However, I have found that the server function is called more than once so the database ends up with duplicates of these objects.
function serverFunction(bigarray) {
// This function is called multiple times
db.saveBatch(bigarray);
}
Yet I can verify that the code on the client that calls serverFunction is only run once.
function clientFunction() {
alert("This function is only called once.");
google.script.run.serverFunction(bigarray);
}
Could my server code be timing out and getting run again automatically by GAS?
If so, how long is the time out and is this functionality documented anywhere?
Is there any way I can avoid this?
Its currently 30 seconds. This is a known issue and will be fixed fairly soon. (Its not a regression per se since its been like this since day 1 but I need to fix it to match the scripts own five minute timeout).