Python Lambda actioning CSV file once but not the second time - json

I am experiencing a strange issue with my Python code. It's objective is the following:
Retrieve a .csv from S3
Convert that .csv into JSON (Its an array of objects)
Add a few key value pairs to each object in the Array, and change the original key values
Validate the JSON
Sent the JSON to a /output S3 bucket
Load the JSON into Dynamo
Here's what the .csv looks like:
Prefix,Provider
ABCDE,Provider A
QWERT,Provider B
ASDFG,Provider C
ZXCVB,Provider D
POIUY,Provider E
And here's my python script:
import json
import boto3
import ast
import csv
import os
import datetime as dt
from datetime import datetime
import jsonschema
from jsonschema import validate
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
providerCodesSchema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"providerCode": {"type": "string", "maxLength": 5},
"providerName": {"type": "string"},
"activeFrom": {"type": "string", "format": "date"},
"activeTo": {"type": "string"},
"apiActiveFrom": {"type": "string"},
"apiActiveTo": {"type": "string"},
"countThreshold": {"type": "string"}
},
"required": ["providerCode", "providerName"]
}
}
datestamp = dt.datetime.now().strftime("%Y/%m/%d")
timestamp = dt.datetime.now().strftime("%s")
updateTime = dt.datetime.now().strftime("%Y/%m/%d/%H:%M:%S")
nowdatetime = dt.datetime.now()
yesterday = nowdatetime - dt.timedelta(days=1)
nintydaysfromnow = nowdatetime + dt.timedelta(days=90)
def lambda_handler(event, context):
filename_json = "/tmp/file_{ts}.json".format(ts=timestamp)
filename_csv = "/tmp/file_{ts}.csv".format(ts=timestamp)
keyname_s3 = "newloader-ptv/output/{ds}/{ts}.json".format(ds=datestamp, ts=timestamp)
json_data = []
for record in event['Records']:
bucket_name = record['s3']['bucket']['name']
key_name = record['s3']['object']['key']
s3_object = s3.get_object(Bucket=bucket_name, Key=key_name)
data = s3_object['Body'].read()
contents = data.decode('latin')
with open(filename_csv, 'a', encoding='utf-8') as csv_data:
csv_data.write(contents)
with open(filename_csv, encoding='utf-8-sig') as csv_data:
csv_reader = csv.DictReader(csv_data)
for csv_row in csv_reader:
json_data.append(csv_row)
for elem in json_data:
elem['providerCode'] = elem.pop('Prefix')
elem['providerName'] = elem.pop('Provider')
for element in json_data:
element['activeFrom'] = yesterday.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['activeTo'] = nintydaysfromnow.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['apiActiveFrom'] = " "
element['apiActiveTo'] = " "
element['countThreshold'] = "3"
element['updateDate'] = updateTime
try:
validate(instance=json_data, schema=providerCodesSchema)
except jsonschema.exceptions.ValidationError as err:
print(err)
err = "Given JSON data is InValid"
return None
with open(filename_json, 'w', encoding='utf-8-sig') as json_file:
json_file.write(json.dumps(json_data, default=str))
with open(filename_json, 'r', encoding='utf-8-sig') as json_file_contents:
response = s3.put_object(Bucket=bucket_name, Key=keyname_s3, Body=json_file_contents.read())
for jsonElement in json_data:
table = dynamodb.Table('privateProviders-loader')
table.put_item(Item=jsonElement)
print("finished enriching JSON")
os.remove(filename_csv)
os.remove(filename_json)
return None
I'm new to Python, so please forgive any amateur mistakes in the code.
Here's my issue:
When I deploy the code, and add a valid .csv into my S3 bucket, everything works.
When I then add an invalid .csv into my S3 buck, again it work, the import fails as the validation kicks in and tells me the problem.
However, when I add the valid .csv back into the S3 bucket, I get the same cloudwatch log as I did for the invalid .csv, and my Dynamo isn't updated, nor is an output JSON file sent to /output in S3.
With some troubleshooting I've noticed the following behavour:
When I first deploy the code, the first .csv loads as expected (dynamo table updated + JSON file sent to S3 + cloudwatch logs documenting the process)
If I enter the same valid .csv into the S3 bucket, it gives me the same nice looking cloudwatch logs, but none of the other actions take place (Dynamo not updated etc)
If I add the invalid .csv, that seems to break the cycle, and I get a nice Cloudwatch log showing the validation has kicked in, but if I reload the valid .csv, which just previously resulted in good cloudwatch logs (but no actual real outputs), I now get a repeat of the validation error log.
In short, the first time the function is invoked, it seems to work, the second time it doesn't.
It seems as though the python function is caching something or not closing out the function when finished, and I've played about with the return command etc, but nothing I've tried works. I've sunk many hours into trying to move parts of the code around etc. thinking the structure or order of events is the problem, and I've the code above gives me the closest behaviour to expected, given that it seems to work completely the first and only time I load the .csv into S3.
Any help or general pointers would be massively appreciated.
Thanks
P.s. Here's an example of the Cloudwatch log when validation kicks in a and stops an invalid .csv from being processed. If I then add a valid .csv to S£, the function is triggered, but I get this same error, even though the file is actually good.
2021-06-29T22:12:27.709+01:00 'ABCDEE' is too long
2021-06-29T22:12:27.709+01:00 Failed validating 'maxLength' in schema['items']['properties']['providerCode']:
2021-06-29T22:12:27.709+01:00 {'maxLength': 5, 'type': 'string'}
2021-06-29T22:12:27.709+01:00 On instance[2]['providerCode']:
2021-06-29T22:12:27.709+01:00 'ABCDEE'
2021-06-29T22:12:27.710+01:00 END RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9
2021-06-29T22:12:27.710+01:00 REPORT RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9 Duration: 482.43 ms Billed Duration: 483

Related

how to generate schema from a newline delimited JSON file in python

I want to generate schema from a newline delimited JSON file, having each row in the JSON file has variable-key/value pairs. File size can vary from 5 MB to 25 MB.
Sample Data:
{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}
Exptected Schema:
[
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "STRING"},
{"name": "col3", "type": "FLOAT"},
{"name": "col4", "type": "DATE"}
]
Notes:
There is no scope to use any tool, as files loaded into an inbound location dynamically. The code will use to trigger an event as-soon-as file arrives and perform schema comparison.
Your first problem is, that json does not have a date-type. So you will get str there.
What I would do, if I was you is this:
import json
# Wherever your input comes from
inp = """{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}"""
schema = {}
# Split it at newlines
for line in inp.split('\n'):
# each line contains a "dict"
tmp = json.loads(line)
for key in tmp:
# if we have not seen the key before, add it
if key not in schema:
schema[key] = type(tmp[key])
# otherwise check the type
else:
if schema[key] != type(tmp[key]):
raise Exception("Schema mismatch")
# format however you like
out = []
for item in schema:
out.append({"name": item, "type": schema[item].__name__})
print(json.dumps(out, indent=2))
I'm using python types for simplicity, but you can write your own function to get the type, e.g. if you want to check if a string is actually a date.

CSV Reader works, Trouble with CSV writer

I am writing a very simple python script to READ a CSV (no problem) and to write to another CSV (issue):
System info:
Windows 10
Powershell
Python 3.6.5 :: Anaconda, Inc.
Sample Data: Office Events
The purpose is to filter events based on criteria, and to write to another CSV with desired criteria.
For Example:
I would like to read from this CSV and write the events where Registrations (or column 4) is Greater than 0 (remove rows with registrations = 0)
# SCRIPT TO FILTER EVENTS TO BE PROCESSED
import os
import time
import shutil
import os.path
import fnmatch
import csv
import glob
import pandas
# Location of file containing ALL events
path = r'allEvents.csv'
# Writes to writer
writer = csv.writer(open(r'RegisteredEvents' + time.strftime("%m_%d_%Y-%I_%M_%S") + '.csv', "wb"))
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
#writer.writerow([r'Event Name', r'Start Date', r'End Date', r'Registrations', r'Total Revenue', r'ID', r'Status'])
#writer.writerow([b'Event Name', b'Start Date', b'End Date', b'Registrations', b'Total Revenue', b'ID', b'Status'])
def checkRegistrations(file):
reader = csv.reader(file)
data = list(reader)
for row in data:
#if row[3] > str(0):
if row[3] > int(0):
writer.writerow(([data]))
The Error I continue to get is:
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
TypeError: a bytes-like object is required, not 'str'
I have tried using the various commented out statements
For Example:
"" vs r"" vs r'' vs b''
if row[3] > int(0) **vs** if row[3] > str(0)
Every time I execute my script, It creates the file.. so the first csv writer line works (create and open the file)... the second line (to write the headers) is when the error appears...
Perhaps I am getting mixed up with syntax due to python versions, or perhaps I am misusing the CSV library, or (more than likely) I have endless to learn about data type IO and conversion... someone please help!!
I am aware of the excess of import libraries -- script came from another basic script to move files from one location to another based on filename and output a rowcounter for each file being moved.
With that being said, I may be unaware of any missing/ needed libraries
Please let me know if you have any questions, concerns or clarifications
Thanks in advance!
It looks like you are calling:
writer = csv.writer(open('file.csv', 'wb'))
The 'wb' argument is the file mode. The 'b' means that you are opening the file that you are writing to in binary mode. You are then trying to write a string which isn't what it is expecting.
Try getting rid of the 'b' in the 'wb'.
writer = csv.writer(open('file.csv', 'w'))
Let me know if that works for you.

How to write all the returned data into JSON file using Python?

How to write the returned dict into a json file.
until now I am able to returned the correct data and print it, but when I tried to write it into a JSON file it just write the last record.
I will appreciate any help to fix this
example :
printed data:
[{"file Name": "test1.txt", "searched Word": "th", "number of occurence": 1}]
[{"file Name": "test2.txt", "searched Word": "th", "number of occurence": 1}]
json file
[{"file Name": "test2.txt", "searched Word": "th", "number of occurence": 1}]
code:
import re
import json
import os.path
import datetime
for counter, myLine in enumerate(textList):
thematch=re.sub(searchedSTR,RepX,myLine)
matches = re.findall(searchedSTR, myLine, re.MULTILINE | re.IGNORECASE)
if len(matches) > 0:
# add one record for the match (add one because line numbers start with 1)
d[matches[0]].append(counter + 1)
self.textEdit_PDFpreview.insertHtml(str(thematch))
'''
loop over the selected file and extract 3 values:
==> name of file
==> searched expression
==> number of occurence
'''
for match, positions in d.items():
listMetaData = [{"file Name":fileName,"searched Word":match,"number of occurence":len(positions)}]
jsondata = json.dumps(listMetaData)
print("in the for loop ==>jsondata: \n{0}".format(jsondata))
'''
create a folder called 'search_result' that includes all the result of the searching as JSON file
where the system check if the folder exist will continue if not the system create the folder
insode the folder the file will be created as ==> today_searchResult.js
'''
if not(os.path.exists("./search_result")):
try:
#print(os.mkdir("./search_result"))
#print(searchResultFoder)
today = datetime.date.today()
fileName = "{}_searchResult.json".format(today)
#fpJ = os.path.join(os.mkdir("./search_result"),fileName)
#print(fpJ)
with open(fileName,"w") as jsf:
jsf.write(jsondata)
print("finish writing")
except Exception as e:
print(e)

How do you print multiple key values from sub keys in a .json file?

Im pulling a list of AMI ids from my AWS account and its being written into a json file.
The json looks basically like this:
{
"Images": [
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-XXXXXXXX"
},
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-aaaaaaaa"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-bbbbbbb"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-cccccccc"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-ddddddd"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-eeeeeeee"
}
]
}
My code looks like this so far after gathering the info and writing it to a .json file locally:
#writes json output to file...
print('writing to response.json...')
with open('response.json', 'w') as outfile:
json.dump(response, outfile, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))
#Searches file...
print('opening response.json...')
with open("response.json") as f:
file_parsed = json.load(f)
The next part im stuck on is how to iterate through the file and print only the CreationDate and ImageId values.
print('printing CreationDate and ImageId...')
for ami in file_parsed['Images']:
#print ami['CreationDate'] #THIS WORKS
#print ami['ImageId'] #THIS WORKS
#print ami['CreationDate']['ImageId']
The last line there gives me this no matter how I have tried it: TypeError: string indices must be integers
My desired output is something like this:
2017-11-24T11:05:32.000Z ami-XXXXXXXX
Ultimately what im looking to do is then iterate through lines that are a certain date or older and deregister those AMIs. So would I be converting these to a list or a dict?
Pretty much not a programmer here so dont drown me.
TIA
You have almost parsed the json but for the desired output you need to concatenate the 'CreationDate' and 'ImageId' like this:
for ami in file_parsed['Images']:
print(ami['CreationDate'] + " "+ ami['ImageId'])
CreationDate evaluates to a string. So you can only take numerical indices of a string which is why ['CreationDate']['ImageId'] leads to a TypeError. Your other two commented lines, however, were correct.
To check if the date is older, you can make use of the datetime module. For instance, you can take the CreationDate (which is a string), convert it to a datetime object, create your own based on what that certain date is, and compare the two.
Something to this effect:
def checkIfOlder(isoformat, targetDate):
dateAsString = datetime.strptime(isoformat, '%Y-%m-%dT%H:%M:%S.%fZ')
return dateAsString <= targetDate
certainDate = datetime(2017, 11, 30) # Or whichever date you want
So in your for loop:
for ami in file_parsed['Images']:
creationDate = ami['CreationDate']
if checkIfOlder(creationDate, certainDate):
pass # write code to deregister AMIs here
Resources that would benefit would be Python's datetime documentation and in particular, the strftime/strptime directives. HTH!

Livy Server: return a dataframe as JSON?

I am executing a statement in Livy Server using HTTP POST call to localhost:8998/sessions/0/statements, with the following body
{
"code": "spark.sql(\"select * from test_table limit 10\")"
}
I would like an answer in the following format
(...)
"data": {
"application/json": "[
{"id": "123", "init_date": 1481649345, ...},
{"id": "133", "init_date": 1481649333, ...},
{"id": "155", "init_date": 1481642153, ...},
]"
}
(...)
but what I'm getting is
(...)
"data": {
"text/plain": "res0: org.apache.spark.sql.DataFrame = [id: string, init_date: timestamp ... 64 more fields]"
}
(...)
Which is the toString() version of the dataframe.
Is there some way to return a dataframe as JSON using the Livy Server?
EDIT
Found a JIRA issue that addresses the problem: https://issues.cloudera.org/browse/LIVY-72
By the comments one can say that Livy does not and will not support such feature?
I recommend using the built-in (albeit hard to find documentation for) magics %json and %table:
%json
session_url = host + "/sessions/1"
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""\
val d = spark.sql("SELECT COUNT(DISTINCT food_item) FROM food_item_tbl")
val e = d.collect
%json e
""")}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()
%table
session_url = host + "/sessions/21"
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""\
val x = List((1, "a", 0.12), (3, "b", 0.63))
%table x
""")}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()
Related: Apache Livy: query Spark SQL via REST: possible?
I don't have a lot of experience with Livy, but as far as I know this endpoint is used as an interactive shell and the output will be a string with the actual result that would be shown by a shell. So, with that in mind, I can think of a way to emulate the result you want, but It may not be the best way to do it:
{
"code": "println(spark.sql(\"select * from test_table limit 10\").toJSON.collect.mkString(\"[\", \",\", \"]\"))"
}
Then, you will have a JSON wrapped in a string, so your client could parse it.
I think in general your best bet is to write your output to a database of some kind. If you write to a randomly named table, you could have your code read it after the script is done.