Memory issues in a Cloud Storage Function - google-cloud-functions

I have deployed a storage trigger cloud function that needs more memory. While deploying the GCF, I have deployed in the following manner with the appropriate flags.
gcloud functions deploy GCF_name--runtime python37 --trigger-resource bucket_name --trigger-event google.storage.object.finalize --timeout 540s --memory 8192MB
But I observed in the google cloud console, the memory utilization map is not going beyond 2GB. And in the logs I am getting this error, Function execution took 34566 ms, finished with status: 'connection error' which happens because of memory shortage. Can I get some help on this.
Edited
The application uploads text files to the storage that contains certain number of samples. Each file is read when it is uploaded to the storage and the data appended to a pre existing file. The total number of samples will be maximum of 75600002. That's why I need 8GB data. Its giving the connection error while appending the data to the file.
def write_to_file(filename,data,write_meta = False,metadata = []):
file1 = open('/tmp/'+ filename,"a+")
if write_meta:
file1.write(":".join(metadata))
file1.write('\n')
file1.write(",".join(data.astype(str)))
file1.close()
The memory utilisation map was the same after every upload.

You are writing a file to /tmp which is an in-memory filesystem. So start by deleting that file when you finish uploading it. In fact those:
Files that you write consume memory available to your function, and sometimes persist between invocations. Failing to explicitly delete these files may eventually lead to an out-of-memory error and a subsequent cold start.
Ref : https://cloud.google.com/functions/docs/bestpractices/tips#always_delete_temporary_files

Related

unable to get status code of oci-cli command

i need to get the response code to use in scripts
like i run a command
oci compute instance update --instance-id ocid.of.instance --shape-config '{"OCPU":"2"}' --force
i will get this message
ServiceError:
{
"code": "InternalError",
"message": "Out of host capacity.",
"opc-request-id": "3FF4337F4ECE43BBB4B8E52524E80247/37CB970D371A9C6BB01DFB23E754FE5B/18DFE9AE75B88A77AB3A1FBEBD3B191B",
"status": 500
}
in this case, i got the error message and a status code 500
but if the commond works, it will output a full json of my instance's parameters, and i can only see a line of response code 200 in debug mode
is there a way to only show the response code?
Currently OCI CLI does not provide the HTTP response code directly in the response. The response would either contain the service response in case of success or a service error message in case of error.
Can you explain how you are using the HTTP response code in your script? Could you not use the command error code (non-zero on error) to determine the error case?
The ERROR: "Out of host capacity" means The selected shape does not have any available servers in the selected region and Availability Domain (AD). Virtual Machines (VM) are dynamically provisioned. If an AD has reached a minimum threshold, new hypervisors (physical servers) will be automatically provisioned.
There may be some occasions where the additional capacity has not finished provisioning before the existing capacity is exhausted, but when retrying in 15 minutes the customer may find the shape they want is available.
Alternatively, selecting a different shape, AD or region will almost certainly have the capacity needed.
Bare metal instances: Host capacity is ordered on a proactive basis guided by the growth rate of a region. Specialized shapes such as DenseIO do not have as much spare overhead and may be more likely to run out of capacity. Customers may need to try another AD or region.

How to read CSV data stored in Google Cloud Storage with Cloud Functions

As part of a communications effort to a large user base, I need to send upwards of 75,000 emails per day. The emails of the users I'm contacting are stored in a CSV file. I've been using Postman Runner to send these requests via SendGrid (Email API), but with such a large volume, my computer either slows way down or Postman completely crashes before the batch completes. Even if it doesn't crash, it takes upwards of 3 hours to send this many POST requests via Runner.
I'd like to upload the CSV containing the emails into a Cloud Storage bucket and then access the file using Cloud Functions to send a POST request for each email. This way, all the processing can be handled by GCP and not by my personal machine. However, I can't seem to get the Cloud Function to read the CSV data line-by-line. I've tried using createReadStream() from the Cloud Storage NodeJS client library along with csv-parser, but can't get this solution to work. Below is what I tried:
const sendGridMail = require('#sendgrid/mail');
const { Storage } = require('#google-cloud/storage');
const fs = require('fs');
const csv = require('csv-parser');
exports.sendMailFromCSV = (file, context) => {
console.log(` Event: ${context.eventId}`);
console.log(` Event Type: ${context.eventType}`);
console.log(` Bucket: ${file.bucket}`);
console.log(` File: ${file.name}`);
console.log(` Metageneration: ${file.metageneration}`);
console.log(` Created: ${file.timeCreated}`);
console.log(` Updated: ${file.updated}`);
const storage = new Storage();
const bucket = storage.bucket(file.bucket);
const remoteFile = bucket.file(file.name);
console.log(remoteFile);
let emails = [];
fs.createReadStream(remoteFile)
.pipe(csv())
.on('data', function (row) {
console.log(`Email read: ${row.email}`);
emails.push(row.email);
//send email using the SendGrid helper library
const msg = {
to: [{
"email": row.email;
}],
from: "fakeemail#gmail.com",
template_id: "fakeTemplate",
};
sendGridMail.send(msg).then(() =>
context.status(200).send(file.body))
.catch(function (err) {
console.log(err);
context.status(400).send(file.body);
});
})
.on('end', function () {
console.table(emails);
});
};
The Cloud Function is currently triggered by an upload to the Cloud Storage bucket.
Is there a way to build a solution to this problem without loading the file into memory? Is Cloud Functions to right path to be moving down, or would it be better to use App Engine or some other tool? Willing to try any GCP solution that moves this process to the cloud
Cloud Function's memory can be shared/used as a temporary directory /tmp. Thus, you can download the csv file from the cloud storage bucket into that directory as a local file, and then process it, as if that file is handled from the local drive.
At the same time, you may would like to remember about 2 main restrictions:
Memory - up to 2Gb for everything
Timeout - no more than 540 seconds per invocation.
I personally would create a solution based on a combination of a few GCP resources.
The first cloud function is triggered by a 'finlize' event - when the csv file is saved in the bucket. This cloud function reads the file and for every record composes a Pub/Sub message with relevant details (enough to send an email). That message is posted into a Pub/Sub topic.
The Pub/Sub topic is used to transfer all messages from the first cloud function to trigger the second cloud function.
The second cloud function is triggered by a Pub/Sub message, which contains all neccessary details to process and send an email. As there may be 75K records in the source csv file (for example), you should expect 75K invocations of the second cloud function.
That may be enough at a high level. Pub/Sub paradigm guarantees at least once delivery (but may be more than once), so if you need no more than one email per address, some additional resources may be required to achieve an idempotent behaviour.
Basically you will have to download the file locally in the Cloud Function machine to be able to read it in this way.
Now there are multiple options to workaround this.
The most basic/simplest is to provision a Compute Engine machine and run this operation from it if is a once on a time event.
If you need to do this more frequently (i.e. daily) you can use an online tool to convert your csv file into json and import it to Firestore, then you can read a lot faster the emails from Firestore.

How to use option Arbitrtaion=WaitExternal in MySQL Cluster?

I'm currently reading MySQL Reference Manual and notice that there an option of NDB config -- Arbitrtaion=WaitExternal. The question is how to use this option and how to implement an external cluster manager?
The Arbitration parameter also makes it possible to configure arbitration in
such a way that the cluster waits until after the time determined by Arbitrat-
ionTimeout has passed for an external cluster manager application to perform
arbitration instead of handling arbitration internally. This can be done by
setting Arbitration = WaitExternal in the [ndbd default] section of the config.ini
file. For best results with the WaitExternal setting, it is recommended that
ArbitrationTimeout be 2 times as long as the interval required by the external
cluster manager to perform arbitration.
A bit of git annotate and some searching of original design docs says the following:
When the arbitrator is about to send an arbitration message to the arbitrator it will instead issue the following log message:
case ArbitCode::WinWaitExternal:{
char buf[8*4*2+1];
sd->mask.getText(buf);
BaseString::snprintf(m_text, m_text_len,
"Continuing after wait for external arbitration, "
"nodes: %s", buf);
break;
}
So e.g.
Continuing after wait for external arbitration, nodes: 1,2
The external clusterware should check for this message
at the same interval as the ArbitrationTimeout.
When it discovers this message, the external cluster ware
should kill the data node that it decides to lose the
arbitration.
This kill will be noted by the NDB data nodes and will
decide the matter which node is to survive.

Google Cloud SQL No Response

We are running a Sails.js API on Google Container Engine with a Cloud SQL database and recently we've been finding some of our endpoints have been stalling, never sending a response.
I had a health check monitoring /v1/status and it registered 100% uptime when I had the following simple response;
status: function( req, res ){
res.ok('Welcome to the API');
}
As soon as we added a database query, the endpoint started timing out. It doesn't happen all the time, but seemingly at random intervals, sometimes for hours on end. This is what we have changed the query to;
status: function( req, res ){
Email.findOne({ value: "someone#example.com" }).then(function( email ){
res.ok('Welcome to the API');
}).fail(function(err){
res.serverError(err);
});
}
Rather suspiciously, this all works fine in our staging and development environments, it's only when the code is deployed in production that the timeout occurs and it only occurs some of the time. The only thing that changes between staging and production is the database we are connecting to and the load on the server.
As I mentioned earlier we are using Google Cloud SQL and the Sails-MySQL adapter. We have the following error stacks from the production server;
AdapterError: Invalid connection name specified
at getConnectionObject (/app/node_modules/sails-mysql/lib/adapter.js:1182:35)
at spawnConnection (/app/node_modules/sails-mysql/lib/adapter.js:1097:7)
at Object.module.exports.adapter.find (/app/node_modules/sails-mysql/lib/adapter.js:801:16)
at module.exports.find (/app/node_modules/sails/node_modules/waterline/lib/waterline/adapter/dql.js:120:13)
at module.exports.findOne (/app/node_modules/sails/node_modules/waterline/lib/waterline/adapter/dql.js:163:10)
at _runOperation (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/operations.js:408:29)
at run (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/operations.js:69:8)
at bound.module.exports.findOne (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/basic.js:78:16)
at bound [as findOne] (/app/node_modules/sails/node_modules/lodash/dist/lodash.js:729:21)
at Deferred.exec (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:501:16)
at tryCatcher (/app/node_modules/sails/node_modules/waterline/node_modules/bluebird/js/main/util.js:26:23)
at ret (eval at <anonymous> (/app/node_modules/sails/node_modules/waterline/node_modules/bluebird/js/main/promisify.js:163:12), <anonymous>:13:39)
at Deferred.toPromise (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:510:61)
at Deferred.then (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:521:15)
at Strategy._verify (/app/api/services/passport.js:31:7)
at Strategy.authenticate (/app/node_modules/passport-local/lib/strategy.js:90:12)
at attempt (/app/node_modules/passport/lib/middleware/authenticate.js:341:16)
at authenticate (/app/node_modules/passport/lib/middleware/authenticate.js:342:7)
at Object.AuthController.login (/app/api/controllers/AuthController.js:119:5)
at bound (/app/node_modules/sails/node_modules/lodash/dist/lodash.js:729:21)
at routeTargetFnWrapper (/app/node_modules/sails/lib/router/bind.js:179:5)
at callbacks (/app/node_modules/sails/node_modules/express/lib/router/index.js:164:37)
Error (E_UNKNOWN) :: Encountered an unexpected error :
Could not connect to MySQL: Error: Pool is closed.
at afterwards (/app/node_modules/sails-mysql/lib/connections/spawn.js:72:13)
at /app/node_modules/sails-mysql/lib/connections/spawn.js:40:7
at process._tickDomainCallback (node.js:381:11)
Looking at the errors alone, I'd be tempted to say that we have something misconfigured. But the fact that it works some of the time (and has previously been working fine!) leads me to believe that there's some other black magic at work here. Our Cloud SQL instance is D0 (though we've tried upping the size to D4) and our activation policy is "Always On".
EDIT: I had seen others complain about Google Cloud SQL eg. this SO post and I was suspicious but we have since moved our database to Amazon RDS and we are still seeing the same issues, so it must be a problem with sails and the mysql adapter.
This issue is leading to hours of downtime a day, we need it resolved, any help is much appreciated!
This appears to be a sails issue, and not necessarily related to Cloud SQL.
Is there any way the QPS limit for Google Cloud SQL is being reached? See here: https://cloud.google.com/sql/faq#sizeqps
Why is my database instance sometimes slow to respond?
In order to minimize the amount you are charged for instances on per use billing plans, by default your instance becomes passive if it is not accessed for 15 minutes. The next time it is accessed there will be a short delay while it is activated. You can change this behavior by configuring the activation policy of the instance. For an example, see Editing an Instance Using the Cloud SDK.
It might be related to your policy setting. If you set it to ON_DEMAND, the instance will sleep to save your budget so that the first query to activate the instance is slow. This might cause the timeout.
https://cloud.google.com/sql/faq?hl=en

SOLR - Best approach to import 20 million documents from csv file

My current task on hand is to figure out the best approach to load millions of documents in solr.
The data file is an export from DB in csv format.
Currently, I am thinking about splitting the file into smaller files and having a script while post this smaller ones using curl.
I have noticed that if u post high amount of data, most of the time the request times out.
I am looking into Data importer and it seems like a good option
Any others ideas highly appreciated
Thanks
Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.
As I see it, you have a couple of options (In my order of preference):
Increase container timeout
Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).
I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.
Split the file
Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):
split -d -l 500000 data.csv split_files.
for file in `ls split_files.*`
do
curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary #$file
done
Parse the file and load in chunks
The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.
import au.com.bytecode.opencsv.CSVReader
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument
#Grapes([
#Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
#Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
#Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
])
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
new File("data.csv").withReader { reader ->
CSVReader csv = new CSVReader(reader)
String[] result
Integer count = 1
Integer chunkSize = 500000
while (result = csv.readNext()) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", result[0])
doc.addField("name_s", result[1])
doc.addField("category_s", result[2])
server.add(doc)
if (count.mod(chunkSize) == 0) {
server.commit()
}
count++
}
server.commit()
}
In SOLR 4.0 (currently in BETA), CSV's from a local directory can be imported directly using the UpdateHandler. Modifying the example from the SOLR Wiki
curl http://localhost:8983/solr/update?stream.file=exampledocs/books.csv&stream.contentType=text/csv;charset=utf-8
And this streams the file from the local location, so no need to chunk it up and POST it via HTTP.
Above answers have explained really well the ingestion strategies from single machine.
Few more options if you have big data infrastructure in place and want to implement distributed data ingestion pipeline.
Use sqoop to bring data to hadoop or place your csv file manually in hadoop.
Use one of below connector to ingest data:
hive- solr connector, spark- solr connector.
PS:
Make sure no firewall blocks connectivity between client nodes and solr/solrcloud nodes.
Choose right directory factory for data ingestion, if near real time search is not required then use StandardDirectoryFactory.
If you get below exception in client logs during ingestion then tune autoCommit and autoSoftCommit configuration in solrconfig.xml file.
SolrServerException: No live SolrServers available to handle this
request
Definitely just load these into a normal database first. There's all sorts of tools for dealing with CSVs (for example, postgres' COPY), so it should be easy. Using Data Import Handler is also pretty simple, so this seems like the most friction-free way to load your data. This method will also be faster since you won't have unnecessary network/HTTP overhead.
The reference guide says ConcurrentUpdateSolrServer could/should be used for bulk updates.
Javadocs are somewhat incorrect (v 3.6.2, v 4.7.0):
ConcurrentUpdateSolrServer buffers all added documents and writes them into open HTTP connections.
It doesn't buffer indefinitely, but up to int queueSize, which is a constructor parameter.