trying to import json config and environmental variables into Glue - json

I am trying to run a glue job that would potentially read from various buckets based on the lambda trigger. What I am trying to do is have those environment variables (S3 URI) stored in a JSON file.
My idea is to write an if condition that would run different ETL scripts based on the key values from this json file.
The issue I am facing is that there isn't any clear documentation that can help me read this json file and extract the data from it.
I do have this code that reads the file from an S3 bucket but based on my understanding, it won't work in the glue script because it's not using the gluecontext.
bckt = s3_file_path.split("/", 3)[2]
key = s3_file_path.split("/", 3)[-1]
try:
s3_client.s3.head_object(Bucket=bckt, Key=key)
except ClientError as e:
raise FileNotFoundError(f"File with provided S3 path not found. {s3_file_path}", e) from e
data = s3_client.s3.get_object(Bucket=bckt, Key=key, **kwargs)["Body"].read().decode()
return json.loads(data)

Related

How to load JSON data (call from API) without key directly to S3 bucket using Python?

I am relatively new to AWS s3 I am calling an API to load the JSON data directly to s3 bucket. From s3 bucket data will be read by Snowflake. After researching I found that using Boto3 we can load data into s3 directly. Code will look something like below, however one thing I am not sure about is What should I put for the key as there is no object created in my S3 bucket. Also, what is the good practice to load the JSON data to s3 ? Do I need to encode JSON data to 'UTF-8' as done here by SO user Uwe Bretschneider.
Thanks in advance!
Python code:
import json,urllib.request
import boto3
data = urllib.request.urlopen("https://api.github.com/users?since=100").read()
output = json.loads(data)
print (output) #Checking the data
s3 = boto3.client('s3')
s3.put_object(
Body=str(json.dumps(data))
Bucket='I_HAVE_BUCKET_NAME'
Key='your_key_here'
)
By using put_object, which means you are creating a new object in the bucket, so there is no existing key.
This key is just like a file name in the file system. You can specify whatever names you like, such as my-data.json or some-dir/my-data.json. You can find out more in https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html.
As for encoder, it's always good to specify the encoding IMO, just to make sure your source file has properly encoded too.

Why does writeStream write the metadata but not the JSON data into the S3 bucket from my Databricks workbook?

I have been working on a Databricks workbook to read some data, reformat it, and write it into an S3 bucket.
csv_data_frame = spark.readStream.format("csv").option("cloudFiles.format", "csv").option("path", "/mnt/data/*.tsv").option("header", "true").option("delimiter", "\t").schema(schema).load()
csv_data_frame.writeStream.format("json").option("path", "/mnt/data/").option("checkpointlocation", "/mnt/data/").start()
It does write a metadata file, but no actual json file. Is my writeStream statement formatted correctly?
I got it to work by changing the writeStream command to:
csv_data_frame.writeStream.format("json").option("path", "/mnt/data/")
I made it too complicated.

How do I Pass JSON as a parameter to AWS Lambda

I have a CloudFormation template that consists of a Lambda function that reads messages from the SQS Queue.
Lambda function will read the message from the queue and transform it using a JSON template(Which I want it to be injected externally)
I will deploy different stacks for different products and for each product I will provide different JSON templates to be used for transformation.
I have different options but couldn't decide which one is better;
I can write all JSON files under the project and pack them together and pass related JSON name as a parameter to lambda.
I can store JSON files on S3 and pass S3 URL to lambda so I can read on runtime.
I can store JSON files on Dynamo DB and read from there using the same approach with 2
The first one seems like a better approach as I don't need to read from an external file on every lambda execution. But I will need to pack all templates together.
The last two are a more clear approach but require an external call to read JSON for every call.
Another approach could be (I'm not sure if it is possible) to inject a JSON file to Lambda on deploy from S3 bucket or sth. And Lambda function will read it like an environment variable.
As you can see from the cloudformation documentation Lambda environment variables can be only a Map of Strings, so the actual value you can pass to the function as an environment variable must be a String. You could pass your JSON as a string but the problem is that the max size for all environment variables is 4 KB.
If your templates are bigger and you don't want to call S3 or DynamoDB at runtime you could do a workaround like writing a simple shell script that copies the correct template file to the lambda folder before building and deploying the stack. This way the lambda gets deployed in a package with the code and only the desired json template.
I decided to go with S3 setup and also improved efficiency by storing Json on a global variable (after reading the first time). So I read once and use it for the lifetime of the Lambda container.
I'm not sure this is the best solution but works well enough for my scenario.

Google Cloud Functions: loading GCS JSON files into BigQuery with non-standard keys

I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.
I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:
{
"id": 12345,
"#address": "XXXXXX"
...
}
Of course the "#address" key throws everything off and the cloud function errors out ...
Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.
I am contemplating the option of:
Loading the file in memory into a string var
Replace #address with address
Convert the json new line delimited to a list of dictionaries
Use bigquery stream insert to insert the rows in BQ
But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.
What other options do I have?
And no, I cannot modify the legacy system to rename the "#address" field :(
Thanks!
I'm going to assume the error that you are getting is something like this:
Errors: query: Invalid field name "#address". Fields must contain
only letters, numbers, and underscores, start with a letter or
underscore, and be at most 128 characters long.
This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.
Here's one way of doing it, which is completely serverless:
Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "#" before creating your output TableRow object.
Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.
To sum up, you'll need the following in the conga line:
File > GCS > Cloud Function > Dataflow (template) > BigQuery
The advantages of this:
Event driven
Scalable
Serverless/no-ops
You get monitoring alerting out of the box with Stackdriver
Minimal code
See:
Reading nested JSON in Google Dataflow / Apache Beam
https://cloud.google.com/dataflow/docs/templates/overview
https://shinesolutions.com/2017/03/23/triggering-dataflow-pipelines-with-cloud-functions/
disclosure: the last link is to a blog which was written by one of the engineers I work with.

APP-ENGINE load data from static json file or load data into the datastore?

Im new to app-engine. Writing a rest api. Wondering if anyone has been in this dilemma before?
This data that i have is not alot (3 to 4 pages) and but it changes annually.
Option 1: Write the data as json and parse the json file every time a request comes in.
Option 2: Model into objects and throw into the datastore and then retrieve them whenever a requests comes in.
Does anyone know the pros and cons for each of this method or any better solutions if any.
Of course the answer is it depends.
Here are some of the questions I'd ask myself to make a decision -
do you want to make the change to the data dependent on a code push?
is there sensitive information in the data that should not be checked in to a VCS
what other parts of your system is dependent on this data
how likely are your assumptions about the data going to change in terms of frequency of updating and size
Assuming the data is small (<1MB) and there's no sensitive information in it, I'd start out loading the JSON file as it's the simplest solution.
You don't have to parse the data on each request, but you can parse it at the top level once and effectively treat it as a constant.
Something along these lines -
import os
import json
DATA_FILE = os.path.join(os.path.dirname(__file__), 'YOUR_DATA_FILE.json')
with open(DATA_FILE, 'r') as dataFile:
JSON_DATA = json.loads(dataFile.read())
You can then use JSON_DATA like a dictionary in your code.
awesome_data = JSON_DATA['data']['awesome']
In case you need to access the data in multiple places, you can move this into its own module (ex. config.py) and import JSON_DATA wherever you need it.
Ex. in main.py
from config import JSON_DATA
# do something w/ JSON_DATA