JSON to Parquet Kinesis Firehose timestamp column

JSON to Parquet Kinesis Firehose timestamp column - json

I am using a Kinesis Firehose to write JSON data coming from IoT core into S3.
I have enabled the data format conversion to transform the JSON payload to parquet. For that I have an AWS Glue table scheme, e.g.
temperature: float
humidity: float
project: string
timestamp: timestamp
My JSON payload looks like this:
{
"temperature": 20
"humidity": 60
"project": project
"timestamp": 2023-02-17 16:15:16.486570
}
The process works fine and I get my parquet file in S3, but oddly it doesn't have the timestamp column. I've tried multiple timestamp formats, it either gives an error in the process or doesn't give me a timestamp column.
What am I missing? What format should the timestamp be?

I finally managed to make it work.
curr_dt = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3] + 'Z'
payload = '{
"timestamp": "' + curr_dt + '",
"temperature": ' + str(round(temperature, 2)) + ',
"humidity":' + str(round(humidity, 3)) + ',
"project_id": "1"
}'

Related

writing both string and json string to a csv file

I use requests to pull json files of companies. How I add a ticker column and json string in a csv file (separated by comma) so I can import the csv file into postgresql?
My python code:
ticker_list = ['AAPL','MSFT','IBM', 'APD']
for ticker in ticker_list:
url_profile = fmp_url + profile + ticker + '?apikey=' + apikey
#get data in json array format
json_array = requests.get(url_profile).json()
# for each record within the json array, use json.dumps to turn it into an json string.
json_str = [json.dumps(element) for element in json_array]
#add a ticker colum and write both ticker and json string to a csv file:
with open ("C:\\DATA\\fmp_profile_no_brackets.csv","a") as dest:
for element in json_str:
dest.writelines (ticker_str + ',' + element + '\n' )
In postgres I have table t_profile_json with 2 columns:
ticker varchar(20) and profile jsonb
when I copy the file fmp_profile into postgres by using:
copy fmp.t_profile_json(ticker,profile) from 'C:\DATA\fmp_profile.csv' delimiter ',';
I have this error:
ERROR: extra data after last expected column
CONTEXT: COPY t_profile_json, line 1: "AAPL,{"symbol": "AAPL", "price": 144.49, "beta": 1.219468, "volAvg": 88657734, "mktCap": 22985613828..."
SQL state: 22P04
The copy command seems to add both "AAPL, json string.." as one string.
I did something wrong at the "dest.writelines (ticker_str + ',' + element + '\n' )", but I don't know how to correct it.
Thank you so much in advance for helping!

Python Lambda actioning CSV file once but not the second time

I am experiencing a strange issue with my Python code. It's objective is the following:
Retrieve a .csv from S3
Convert that .csv into JSON (Its an array of objects)
Add a few key value pairs to each object in the Array, and change the original key values
Validate the JSON
Sent the JSON to a /output S3 bucket
Load the JSON into Dynamo
Here's what the .csv looks like:
Prefix,Provider
ABCDE,Provider A
QWERT,Provider B
ASDFG,Provider C
ZXCVB,Provider D
POIUY,Provider E
And here's my python script:
import json
import boto3
import ast
import csv
import os
import datetime as dt
from datetime import datetime
import jsonschema
from jsonschema import validate
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
providerCodesSchema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"providerCode": {"type": "string", "maxLength": 5},
"providerName": {"type": "string"},
"activeFrom": {"type": "string", "format": "date"},
"activeTo": {"type": "string"},
"apiActiveFrom": {"type": "string"},
"apiActiveTo": {"type": "string"},
"countThreshold": {"type": "string"}
},
"required": ["providerCode", "providerName"]
}
}
datestamp = dt.datetime.now().strftime("%Y/%m/%d")
timestamp = dt.datetime.now().strftime("%s")
updateTime = dt.datetime.now().strftime("%Y/%m/%d/%H:%M:%S")
nowdatetime = dt.datetime.now()
yesterday = nowdatetime - dt.timedelta(days=1)
nintydaysfromnow = nowdatetime + dt.timedelta(days=90)
def lambda_handler(event, context):
filename_json = "/tmp/file_{ts}.json".format(ts=timestamp)
filename_csv = "/tmp/file_{ts}.csv".format(ts=timestamp)
keyname_s3 = "newloader-ptv/output/{ds}/{ts}.json".format(ds=datestamp, ts=timestamp)
json_data = []
for record in event['Records']:
bucket_name = record['s3']['bucket']['name']
key_name = record['s3']['object']['key']
s3_object = s3.get_object(Bucket=bucket_name, Key=key_name)
data = s3_object['Body'].read()
contents = data.decode('latin')
with open(filename_csv, 'a', encoding='utf-8') as csv_data:
csv_data.write(contents)
with open(filename_csv, encoding='utf-8-sig') as csv_data:
csv_reader = csv.DictReader(csv_data)
for csv_row in csv_reader:
json_data.append(csv_row)
for elem in json_data:
elem['providerCode'] = elem.pop('Prefix')
elem['providerName'] = elem.pop('Provider')
for element in json_data:
element['activeFrom'] = yesterday.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['activeTo'] = nintydaysfromnow.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['apiActiveFrom'] = " "
element['apiActiveTo'] = " "
element['countThreshold'] = "3"
element['updateDate'] = updateTime
try:
validate(instance=json_data, schema=providerCodesSchema)
except jsonschema.exceptions.ValidationError as err:
print(err)
err = "Given JSON data is InValid"
return None
with open(filename_json, 'w', encoding='utf-8-sig') as json_file:
json_file.write(json.dumps(json_data, default=str))
with open(filename_json, 'r', encoding='utf-8-sig') as json_file_contents:
response = s3.put_object(Bucket=bucket_name, Key=keyname_s3, Body=json_file_contents.read())
for jsonElement in json_data:
table = dynamodb.Table('privateProviders-loader')
table.put_item(Item=jsonElement)
print("finished enriching JSON")
os.remove(filename_csv)
os.remove(filename_json)
return None
I'm new to Python, so please forgive any amateur mistakes in the code.
Here's my issue:
When I deploy the code, and add a valid .csv into my S3 bucket, everything works.
When I then add an invalid .csv into my S3 buck, again it work, the import fails as the validation kicks in and tells me the problem.
However, when I add the valid .csv back into the S3 bucket, I get the same cloudwatch log as I did for the invalid .csv, and my Dynamo isn't updated, nor is an output JSON file sent to /output in S3.
With some troubleshooting I've noticed the following behavour:
When I first deploy the code, the first .csv loads as expected (dynamo table updated + JSON file sent to S3 + cloudwatch logs documenting the process)
If I enter the same valid .csv into the S3 bucket, it gives me the same nice looking cloudwatch logs, but none of the other actions take place (Dynamo not updated etc)
If I add the invalid .csv, that seems to break the cycle, and I get a nice Cloudwatch log showing the validation has kicked in, but if I reload the valid .csv, which just previously resulted in good cloudwatch logs (but no actual real outputs), I now get a repeat of the validation error log.
In short, the first time the function is invoked, it seems to work, the second time it doesn't.
It seems as though the python function is caching something or not closing out the function when finished, and I've played about with the return command etc, but nothing I've tried works. I've sunk many hours into trying to move parts of the code around etc. thinking the structure or order of events is the problem, and I've the code above gives me the closest behaviour to expected, given that it seems to work completely the first and only time I load the .csv into S3.
Any help or general pointers would be massively appreciated.
Thanks
P.s. Here's an example of the Cloudwatch log when validation kicks in a and stops an invalid .csv from being processed. If I then add a valid .csv to S£, the function is triggered, but I get this same error, even though the file is actually good.
2021-06-29T22:12:27.709+01:00 'ABCDEE' is too long
2021-06-29T22:12:27.709+01:00 Failed validating 'maxLength' in schema['items']['properties']['providerCode']:
2021-06-29T22:12:27.709+01:00 {'maxLength': 5, 'type': 'string'}
2021-06-29T22:12:27.709+01:00 On instance[2]['providerCode']:
2021-06-29T22:12:27.709+01:00 'ABCDEE'
2021-06-29T22:12:27.710+01:00 END RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9
2021-06-29T22:12:27.710+01:00 REPORT RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9 Duration: 482.43 ms Billed Duration: 483

How do you print multiple key values from sub keys in a .json file?

Im pulling a list of AMI ids from my AWS account and its being written into a json file.
The json looks basically like this:
{
"Images": [
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-XXXXXXXX"
},
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-aaaaaaaa"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-bbbbbbb"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-cccccccc"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-ddddddd"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-eeeeeeee"
}
]
}
My code looks like this so far after gathering the info and writing it to a .json file locally:
#writes json output to file...
print('writing to response.json...')
with open('response.json', 'w') as outfile:
json.dump(response, outfile, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))
#Searches file...
print('opening response.json...')
with open("response.json") as f:
file_parsed = json.load(f)
The next part im stuck on is how to iterate through the file and print only the CreationDate and ImageId values.
print('printing CreationDate and ImageId...')
for ami in file_parsed['Images']:
#print ami['CreationDate'] #THIS WORKS
#print ami['ImageId'] #THIS WORKS
#print ami['CreationDate']['ImageId']
The last line there gives me this no matter how I have tried it: TypeError: string indices must be integers
My desired output is something like this:
2017-11-24T11:05:32.000Z ami-XXXXXXXX
Ultimately what im looking to do is then iterate through lines that are a certain date or older and deregister those AMIs. So would I be converting these to a list or a dict?
Pretty much not a programmer here so dont drown me.
TIA

You have almost parsed the json but for the desired output you need to concatenate the 'CreationDate' and 'ImageId' like this:
for ami in file_parsed['Images']:
print(ami['CreationDate'] + " "+ ami['ImageId'])

CreationDate evaluates to a string. So you can only take numerical indices of a string which is why ['CreationDate']['ImageId'] leads to a TypeError. Your other two commented lines, however, were correct.
To check if the date is older, you can make use of the datetime module. For instance, you can take the CreationDate (which is a string), convert it to a datetime object, create your own based on what that certain date is, and compare the two.
Something to this effect:
def checkIfOlder(isoformat, targetDate):
dateAsString = datetime.strptime(isoformat, '%Y-%m-%dT%H:%M:%S.%fZ')
return dateAsString <= targetDate
certainDate = datetime(2017, 11, 30) # Or whichever date you want
So in your for loop:
for ami in file_parsed['Images']:
creationDate = ami['CreationDate']
if checkIfOlder(creationDate, certainDate):
pass # write code to deregister AMIs here
Resources that would benefit would be Python's datetime documentation and in particular, the strftime/strptime directives. HTH!

Livy Server: return a dataframe as JSON?

I am executing a statement in Livy Server using HTTP POST call to localhost:8998/sessions/0/statements, with the following body
{
"code": "spark.sql(\"select * from test_table limit 10\")"
}
I would like an answer in the following format
(...)
"data": {
"application/json": "[
{"id": "123", "init_date": 1481649345, ...},
{"id": "133", "init_date": 1481649333, ...},
{"id": "155", "init_date": 1481642153, ...},
]"
}
(...)
but what I'm getting is
(...)
"data": {
"text/plain": "res0: org.apache.spark.sql.DataFrame = [id: string, init_date: timestamp ... 64 more fields]"
}
(...)
Which is the toString() version of the dataframe.
Is there some way to return a dataframe as JSON using the Livy Server?
EDIT
Found a JIRA issue that addresses the problem: https://issues.cloudera.org/browse/LIVY-72
By the comments one can say that Livy does not and will not support such feature?

I recommend using the built-in (albeit hard to find documentation for) magics %json and %table:
%json
session_url = host + "/sessions/1"
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""\
val d = spark.sql("SELECT COUNT(DISTINCT food_item) FROM food_item_tbl")
val e = d.collect
%json e
""")}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()
%table
session_url = host + "/sessions/21"
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""\
val x = List((1, "a", 0.12), (3, "b", 0.63))
%table x
""")}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()
Related: Apache Livy: query Spark SQL via REST: possible?

I don't have a lot of experience with Livy, but as far as I know this endpoint is used as an interactive shell and the output will be a string with the actual result that would be shown by a shell. So, with that in mind, I can think of a way to emulate the result you want, but It may not be the best way to do it:
{
"code": "println(spark.sql(\"select * from test_table limit 10\").toJSON.collect.mkString(\"[\", \",\", \"]\"))"
}
Then, you will have a JSON wrapped in a string, so your client could parse it.

I think in general your best bet is to write your output to a database of some kind. If you write to a randomly named table, you could have your code read it after the script is done.

Select (ignore if does not exists) for JSON logs Spark SQL

I am new to Apache spark and trying out a few POCs around this. I am trying to read json logs which are structured but a few fields are not always guaranteed, for example :
{
"item": "A",
"customerId": 123,
"hasCustomerId": true,
.
.
.
},
{
"item": "B",
"hasCustomerId": false,
.
.
.
}
}
Assume I want to transform these JSON logs into CSV, I was trying out Spark SQL to get hold of all the fields by simple Select statements but as the second JSON is missing a field(although it does has an identifier) I am not sure how can I handle this.
I want to transform the above json logs to
item, customerId, ....
A , 123 , ....
B , null/0 , ....

You should use SqlContext to read the JOSN file, sqlContext.read.json("file/path") But if you want to convert it into CSV and then you want to read it with missing values. Your CSV file should be look like
item,customerId,hasCustomerId, ....
A,123,, .... // hasCustomerId is null
B,,888, .... // customerId is null
i.e. empty record. Then you have to read this like
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("file/path")

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

JSON to Parquet Kinesis Firehose timestamp column - json

I finally managed to make it work. curr_dt = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3] + 'Z' payload = '{ "timestamp": "' + curr_dt + '", "temperature": ' + str(round(temperature, 2)) + ', "humidity":' + str(round(humidity, 3)) + ', "project_id": "1" }'

Related

writing both string and json string to a csv file

Python Lambda actioning CSV file once but not the second time

How do you print multiple key values from sub keys in a .json file?

Livy Server: return a dataframe as JSON?

Select (ignore if does not exists) for JSON logs Spark SQL

Categories

Resources