Is there any way to track file information (size,name,bucket location,create/update timestamp) in multiple google buckets - google-cloud-functions

I'm able to track the file ingress in a single bucket using below code. But I want to track all the files going into different buckets of my project on Google Cloud..is there any way?
def hello_gcs(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
import os
from google.cloud import bigquery
BQ = bigquery.Client()
table_id = 'xx.DW_STAGE.Bucket_Monitor'
table = BQ.get_table(table_id)
bucket=event['bucket']
file_nm=event['name']
create_ts=event['timeCreated']
update_ts=event['updated']
size=event['size']
contentType=event['contentType']
crc32c=event['crc32c']
etag=event['etag']
generation=event['generation']
file_id=event['id']
kind=event['kind']
md5Hash=event['md5Hash']
medialink=event['mediaLink']
metageneration=event['metageneration']
selfLink=event['selfLink']
storageClass=event['storageClass']
timeStorageClassUpdated=event['timeStorageClassUpdated']
errors = BQ.insert_rows(table, [(bucket,file_nm,create_ts,update_ts,size,contentType,crc32c,etag,generation,file_id,kind,md5Hash,medialink,metageneration,selfLink,storageClass,timeStorageClassUpdated)]) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))

Maybe you will be interested in Pub/Sub. This mechanism is used underneath by Cloud Function Storage triggers (description is here).
For every of your buckets you can create Pub/Sub notification in the same topic (How to create notification you may find here). Every notification can has attributes and payload which contain information about the object which changed.
Such notification can be used as trigger for Cloud Function similar to yours.

Related

Execution ID on Google Cloud Run

I am wondering if there exists an execution id into Cloud Run as the one into Google Cloud Functions?
An ID that identifies each invocation separately, it's very useful to use the "Show matching entries" in Cloud Logging to get all logs related to an execution.
I understand the execution process is different, Cloud Run allows concurrency, but is there a workaround to assign each log to a certain execution?
My final need is to group at the same line the request and the response. Because, as for now, I am printing them separately and if a few requests arrive at the same time, I can't see what response corresponds to what request...
Thank you for your attention!
Open Telemetry looks like a great solution, but the learning and manipulation time isn't negligible,
I'm going with a custom id created in before_request, stored in Flask g and called at every print().
#app.before_request
def before_request_func():
execution_id = uuid.uuid4()
g.execution_id = execution_id

How to read CSV data stored in Google Cloud Storage with Cloud Functions

As part of a communications effort to a large user base, I need to send upwards of 75,000 emails per day. The emails of the users I'm contacting are stored in a CSV file. I've been using Postman Runner to send these requests via SendGrid (Email API), but with such a large volume, my computer either slows way down or Postman completely crashes before the batch completes. Even if it doesn't crash, it takes upwards of 3 hours to send this many POST requests via Runner.
I'd like to upload the CSV containing the emails into a Cloud Storage bucket and then access the file using Cloud Functions to send a POST request for each email. This way, all the processing can be handled by GCP and not by my personal machine. However, I can't seem to get the Cloud Function to read the CSV data line-by-line. I've tried using createReadStream() from the Cloud Storage NodeJS client library along with csv-parser, but can't get this solution to work. Below is what I tried:
const sendGridMail = require('#sendgrid/mail');
const { Storage } = require('#google-cloud/storage');
const fs = require('fs');
const csv = require('csv-parser');
exports.sendMailFromCSV = (file, context) => {
console.log(` Event: ${context.eventId}`);
console.log(` Event Type: ${context.eventType}`);
console.log(` Bucket: ${file.bucket}`);
console.log(` File: ${file.name}`);
console.log(` Metageneration: ${file.metageneration}`);
console.log(` Created: ${file.timeCreated}`);
console.log(` Updated: ${file.updated}`);
const storage = new Storage();
const bucket = storage.bucket(file.bucket);
const remoteFile = bucket.file(file.name);
console.log(remoteFile);
let emails = [];
fs.createReadStream(remoteFile)
.pipe(csv())
.on('data', function (row) {
console.log(`Email read: ${row.email}`);
emails.push(row.email);
//send email using the SendGrid helper library
const msg = {
to: [{
"email": row.email;
}],
from: "fakeemail#gmail.com",
template_id: "fakeTemplate",
};
sendGridMail.send(msg).then(() =>
context.status(200).send(file.body))
.catch(function (err) {
console.log(err);
context.status(400).send(file.body);
});
})
.on('end', function () {
console.table(emails);
});
};
The Cloud Function is currently triggered by an upload to the Cloud Storage bucket.
Is there a way to build a solution to this problem without loading the file into memory? Is Cloud Functions to right path to be moving down, or would it be better to use App Engine or some other tool? Willing to try any GCP solution that moves this process to the cloud
Cloud Function's memory can be shared/used as a temporary directory /tmp. Thus, you can download the csv file from the cloud storage bucket into that directory as a local file, and then process it, as if that file is handled from the local drive.
At the same time, you may would like to remember about 2 main restrictions:
Memory - up to 2Gb for everything
Timeout - no more than 540 seconds per invocation.
I personally would create a solution based on a combination of a few GCP resources.
The first cloud function is triggered by a 'finlize' event - when the csv file is saved in the bucket. This cloud function reads the file and for every record composes a Pub/Sub message with relevant details (enough to send an email). That message is posted into a Pub/Sub topic.
The Pub/Sub topic is used to transfer all messages from the first cloud function to trigger the second cloud function.
The second cloud function is triggered by a Pub/Sub message, which contains all neccessary details to process and send an email. As there may be 75K records in the source csv file (for example), you should expect 75K invocations of the second cloud function.
That may be enough at a high level. Pub/Sub paradigm guarantees at least once delivery (but may be more than once), so if you need no more than one email per address, some additional resources may be required to achieve an idempotent behaviour.
Basically you will have to download the file locally in the Cloud Function machine to be able to read it in this way.
Now there are multiple options to workaround this.
The most basic/simplest is to provision a Compute Engine machine and run this operation from it if is a once on a time event.
If you need to do this more frequently (i.e. daily) you can use an online tool to convert your csv file into json and import it to Firestore, then you can read a lot faster the emails from Firestore.

Get Json Files via API url into Google Cloud

quite new to Google Cloud Platform and having the following task: I get a dynamic .json file via an API url. Now I want to store those .json Files in a given interval on one of the GCP Databases.
So my question is, which DB Service should I use and how do I get the .json files from the url saved into the data base. I had a look If this could work with Cloud Functions, but haven't really found any solution.
Thanks in advance
Alex
For instance if you use Django/Python with Google App Engine.
Create a Google Cloud account with Google Storage/Google App Engine activated. (These are extensive steps, if you need further help please elaborate further)
Create an API with urls.py
Associate the api call with the following fuction :
class JSONFileView(APIView):
#method_decorator(login_required(login_url='/login/'))
def get(self, request, filename):
root_path = request.user.username + "-" + str(request.user.id) + '/'
file_path = os.path.join(root_path, filename, 'yourfile.data', 'yourfile.json')
storage_client = storage.Client(project=settings.PROJECT_ID, credentials=settings.GS_CREDENTIALS)
bucket = storage_client.get_bucket(settings.GS_BUCKET_NAME)
blob = bucket.get_blob(file_path)
if blob != None:
json_data = str(blob.download_as_string(raw_download=True).decode('utf8'))
else:
json_data = {}
return Response(status=status.HTTP_404_NOT_FOUND)
return Response(json_data)

How do I create a lot of sample data for firestore?

Let's say I need to create a lot of different documents/collections in firestore. I need to add it quickly, like copy and paste json. I can't do that with standard firebase console, because adding 100 documents will take me forever. Is there any solutions for to bulk create mock data with a given structure in firestore db?
If you switch to the Cloud Console (rather than Firebase Console) for your project, you can use Cloud Shell as a starting point.
From the Cloud Shell environment you'll find tools like node and python installed and available. Using whatever one you prefer, you can write a script using the Server Client libraries.
For example in Python:
from google.cloud import firestore
import random
MAX_DOCUMENTS = 100
SAMPLE_COLLECTION_ID = u'users'
SAMPLE_COLORS = [u'Blue', u'Red', u'Green', u'Yellow', u'White', u'Black']
# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()
collection_ref = db.collection(SAMPLE_COLLECTION_ID)
for x in range(0, MAX_DOCUMENTS - 1):
collection_ref.add({
u'primary': random.choice(SAMPLE_COLORS),
u'secondary': random.choice(SAMPLE_COLORS),
u'trim': random.choice(SAMPLE_COLORS),
u'accent': random.choice(SAMPLE_COLORS)
})
While this is the easiest way to get up and running with a static dataset, it lives a little to be desired. Namely with Firestore, live dynamic data is needed to exercises it's functionally, such as real-time queries. For this task, using Cloud Scheduler & Cloud Functions is a relatively easy way to regularly updating sample data.
In addition to the sample generation code, you'll specific the update frequency in the Cloud Scheduler. For instance in the image below, */10 * * * * defines a frequency of every 10 minutes using the standard unix-cron format:
For non-static data, often a timestamp is useful. Firestore provides a way to have a timestamp from the database server added at write-time as one of the fields:
u'timestamp': firestore.SERVER_TIMESTAMP
It is worth noting that timestamps like this will hotspot in production systems if not sharded correctly. Typically 500 writes/second to the same collection is the maximum you will want so that the index doesn't hotspot. Sharding can be as simple something like as each user having their own collection (500 writes/second per user). However for this example, writing 100 documents every minute via a scheduled Cloud Function is definitely not an issue.
FireKit is a good resource to use for this purpose. It even allows sub-collections.
https://retroportalstudio.gumroad.com/l/firekit_free

Reading messages from Pub/Sub in batches using Cloud Function

Working through this guide:
https://cloud.google.com/functions/docs/tutorials/pubsub
I ran into an issue where I need to read the messages from Pub/Sub in batches of 1000 per batch. I'll be posting messages in batches to a remote API from my Cloud function.
In short, 1000 messages needs to be read per invocation from Pub/Sub.
I've previously done something similar with Kinesis and Lambda using batch-size parameter but have not found the similar configuration for Cloud function.
aws lambda create-event-source-mapping --region us-west-2 --function-name kinesis-to-bigquery --event-source <arn of the kinesis stream> --batch-size 1000 --starting-position TRIM_HORIZON
Function:
// Pub/Sub function
export function helloPubSub (event, callback) {
const pubsubMessage = event.data;
const name = pubsubMessage.data ? Buffer.from(pubsubMessage.data, 'base64').toString() : 'World';
console.log(`Hello, ${name}!`);
callback();
}
My question is if this is possible using Cloud function or if there exist other approaches to this problem.
Cloud Functions doesn't work with pub/sub like that - you don't read messages out of a queue. Instead, the events are individually delivered to your function as soon as possible. If you want to wait until you get 1000 messages, you'll have to queue them up yourself using some other persistence mechanism, then act on them when you have enough available.