how to generate schema from a newline delimited JSON file in python - json

I want to generate schema from a newline delimited JSON file, having each row in the JSON file has variable-key/value pairs. File size can vary from 5 MB to 25 MB.
Sample Data:
{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}
Exptected Schema:
[
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "STRING"},
{"name": "col3", "type": "FLOAT"},
{"name": "col4", "type": "DATE"}
]
Notes:
There is no scope to use any tool, as files loaded into an inbound location dynamically. The code will use to trigger an event as-soon-as file arrives and perform schema comparison.

Your first problem is, that json does not have a date-type. So you will get str there.
What I would do, if I was you is this:
import json
# Wherever your input comes from
inp = """{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}"""
schema = {}
# Split it at newlines
for line in inp.split('\n'):
# each line contains a "dict"
tmp = json.loads(line)
for key in tmp:
# if we have not seen the key before, add it
if key not in schema:
schema[key] = type(tmp[key])
# otherwise check the type
else:
if schema[key] != type(tmp[key]):
raise Exception("Schema mismatch")
# format however you like
out = []
for item in schema:
out.append({"name": item, "type": schema[item].__name__})
print(json.dumps(out, indent=2))
I'm using python types for simplicity, but you can write your own function to get the type, e.g. if you want to check if a string is actually a date.

Related

Python Lambda actioning CSV file once but not the second time

I am experiencing a strange issue with my Python code. It's objective is the following:
Retrieve a .csv from S3
Convert that .csv into JSON (Its an array of objects)
Add a few key value pairs to each object in the Array, and change the original key values
Validate the JSON
Sent the JSON to a /output S3 bucket
Load the JSON into Dynamo
Here's what the .csv looks like:
Prefix,Provider
ABCDE,Provider A
QWERT,Provider B
ASDFG,Provider C
ZXCVB,Provider D
POIUY,Provider E
And here's my python script:
import json
import boto3
import ast
import csv
import os
import datetime as dt
from datetime import datetime
import jsonschema
from jsonschema import validate
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
providerCodesSchema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"providerCode": {"type": "string", "maxLength": 5},
"providerName": {"type": "string"},
"activeFrom": {"type": "string", "format": "date"},
"activeTo": {"type": "string"},
"apiActiveFrom": {"type": "string"},
"apiActiveTo": {"type": "string"},
"countThreshold": {"type": "string"}
},
"required": ["providerCode", "providerName"]
}
}
datestamp = dt.datetime.now().strftime("%Y/%m/%d")
timestamp = dt.datetime.now().strftime("%s")
updateTime = dt.datetime.now().strftime("%Y/%m/%d/%H:%M:%S")
nowdatetime = dt.datetime.now()
yesterday = nowdatetime - dt.timedelta(days=1)
nintydaysfromnow = nowdatetime + dt.timedelta(days=90)
def lambda_handler(event, context):
filename_json = "/tmp/file_{ts}.json".format(ts=timestamp)
filename_csv = "/tmp/file_{ts}.csv".format(ts=timestamp)
keyname_s3 = "newloader-ptv/output/{ds}/{ts}.json".format(ds=datestamp, ts=timestamp)
json_data = []
for record in event['Records']:
bucket_name = record['s3']['bucket']['name']
key_name = record['s3']['object']['key']
s3_object = s3.get_object(Bucket=bucket_name, Key=key_name)
data = s3_object['Body'].read()
contents = data.decode('latin')
with open(filename_csv, 'a', encoding='utf-8') as csv_data:
csv_data.write(contents)
with open(filename_csv, encoding='utf-8-sig') as csv_data:
csv_reader = csv.DictReader(csv_data)
for csv_row in csv_reader:
json_data.append(csv_row)
for elem in json_data:
elem['providerCode'] = elem.pop('Prefix')
elem['providerName'] = elem.pop('Provider')
for element in json_data:
element['activeFrom'] = yesterday.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['activeTo'] = nintydaysfromnow.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['apiActiveFrom'] = " "
element['apiActiveTo'] = " "
element['countThreshold'] = "3"
element['updateDate'] = updateTime
try:
validate(instance=json_data, schema=providerCodesSchema)
except jsonschema.exceptions.ValidationError as err:
print(err)
err = "Given JSON data is InValid"
return None
with open(filename_json, 'w', encoding='utf-8-sig') as json_file:
json_file.write(json.dumps(json_data, default=str))
with open(filename_json, 'r', encoding='utf-8-sig') as json_file_contents:
response = s3.put_object(Bucket=bucket_name, Key=keyname_s3, Body=json_file_contents.read())
for jsonElement in json_data:
table = dynamodb.Table('privateProviders-loader')
table.put_item(Item=jsonElement)
print("finished enriching JSON")
os.remove(filename_csv)
os.remove(filename_json)
return None
I'm new to Python, so please forgive any amateur mistakes in the code.
Here's my issue:
When I deploy the code, and add a valid .csv into my S3 bucket, everything works.
When I then add an invalid .csv into my S3 buck, again it work, the import fails as the validation kicks in and tells me the problem.
However, when I add the valid .csv back into the S3 bucket, I get the same cloudwatch log as I did for the invalid .csv, and my Dynamo isn't updated, nor is an output JSON file sent to /output in S3.
With some troubleshooting I've noticed the following behavour:
When I first deploy the code, the first .csv loads as expected (dynamo table updated + JSON file sent to S3 + cloudwatch logs documenting the process)
If I enter the same valid .csv into the S3 bucket, it gives me the same nice looking cloudwatch logs, but none of the other actions take place (Dynamo not updated etc)
If I add the invalid .csv, that seems to break the cycle, and I get a nice Cloudwatch log showing the validation has kicked in, but if I reload the valid .csv, which just previously resulted in good cloudwatch logs (but no actual real outputs), I now get a repeat of the validation error log.
In short, the first time the function is invoked, it seems to work, the second time it doesn't.
It seems as though the python function is caching something or not closing out the function when finished, and I've played about with the return command etc, but nothing I've tried works. I've sunk many hours into trying to move parts of the code around etc. thinking the structure or order of events is the problem, and I've the code above gives me the closest behaviour to expected, given that it seems to work completely the first and only time I load the .csv into S3.
Any help or general pointers would be massively appreciated.
Thanks
P.s. Here's an example of the Cloudwatch log when validation kicks in a and stops an invalid .csv from being processed. If I then add a valid .csv to S£, the function is triggered, but I get this same error, even though the file is actually good.
2021-06-29T22:12:27.709+01:00 'ABCDEE' is too long
2021-06-29T22:12:27.709+01:00 Failed validating 'maxLength' in schema['items']['properties']['providerCode']:
2021-06-29T22:12:27.709+01:00 {'maxLength': 5, 'type': 'string'}
2021-06-29T22:12:27.709+01:00 On instance[2]['providerCode']:
2021-06-29T22:12:27.709+01:00 'ABCDEE'
2021-06-29T22:12:27.710+01:00 END RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9
2021-06-29T22:12:27.710+01:00 REPORT RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9 Duration: 482.43 ms Billed Duration: 483

bigrquery bq_table_load csv file with tab delimiter

I am trying to use bigrquery's bq_table_load() command to move a tab delimited csv file from google storage to bigrquery. It works but it doesn't automatically recognize the column names. Doing the same thing interactively (i.e. in the bigquery clould console) works well. Comparing the jobs metadata for the two jobs (R induced jobs vs cloud console job) I note that the column delimiter is not set to TAB for the R job. This is despite me including this in my command call; e.g. as follows:
bq_table_load(<x>,<uri>, fieldDelimiter="Tab", source_format = "CSV", autodetect=TRUE)
I tried all sorts of variations of this...nothing seems to work (i.e. the R job will always have the Comma delimiter set)...here are some of the variations I tried:
bq_table_load(<x>,<uri>, field_delimiter="Tab", source_format = "CSV", autodetect=TRUE)
bq_table_load(<x>,<uri>, field_delimiter="\t", source_format = "CSV", autodetect=TRUE)
bq_table_load(<x>,<uri>, field_delimiter="tab", source_format = "CSV", autodetect=TRUE)
Any suggestions?
You can define Schema using a schema file, a sample is given below:-
Sample BQ load command, where '$schema_dir/$TABLENAME.json' Represent a schema file :-
bq --nosync load --source_format=CSV --skip_leading_rows=3 --allow_jagged_rows=TRUE --max_bad_records=10000 \
--allow_quoted_newlines=TRUE $projectid:$dataset.$TABLENAME \
$csv_data_path/$FILENAME $schema_dir/$TABLENAME.json
Sample Schema file
[
{
"mode": "NULLABLE",
"name": "C1",
"type": "STRING"
}
]

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

How do you print multiple key values from sub keys in a .json file?

Im pulling a list of AMI ids from my AWS account and its being written into a json file.
The json looks basically like this:
{
"Images": [
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-XXXXXXXX"
},
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-aaaaaaaa"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-bbbbbbb"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-cccccccc"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-ddddddd"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-eeeeeeee"
}
]
}
My code looks like this so far after gathering the info and writing it to a .json file locally:
#writes json output to file...
print('writing to response.json...')
with open('response.json', 'w') as outfile:
json.dump(response, outfile, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))
#Searches file...
print('opening response.json...')
with open("response.json") as f:
file_parsed = json.load(f)
The next part im stuck on is how to iterate through the file and print only the CreationDate and ImageId values.
print('printing CreationDate and ImageId...')
for ami in file_parsed['Images']:
#print ami['CreationDate'] #THIS WORKS
#print ami['ImageId'] #THIS WORKS
#print ami['CreationDate']['ImageId']
The last line there gives me this no matter how I have tried it: TypeError: string indices must be integers
My desired output is something like this:
2017-11-24T11:05:32.000Z ami-XXXXXXXX
Ultimately what im looking to do is then iterate through lines that are a certain date or older and deregister those AMIs. So would I be converting these to a list or a dict?
Pretty much not a programmer here so dont drown me.
TIA
You have almost parsed the json but for the desired output you need to concatenate the 'CreationDate' and 'ImageId' like this:
for ami in file_parsed['Images']:
print(ami['CreationDate'] + " "+ ami['ImageId'])
CreationDate evaluates to a string. So you can only take numerical indices of a string which is why ['CreationDate']['ImageId'] leads to a TypeError. Your other two commented lines, however, were correct.
To check if the date is older, you can make use of the datetime module. For instance, you can take the CreationDate (which is a string), convert it to a datetime object, create your own based on what that certain date is, and compare the two.
Something to this effect:
def checkIfOlder(isoformat, targetDate):
dateAsString = datetime.strptime(isoformat, '%Y-%m-%dT%H:%M:%S.%fZ')
return dateAsString <= targetDate
certainDate = datetime(2017, 11, 30) # Or whichever date you want
So in your for loop:
for ami in file_parsed['Images']:
creationDate = ami['CreationDate']
if checkIfOlder(creationDate, certainDate):
pass # write code to deregister AMIs here
Resources that would benefit would be Python's datetime documentation and in particular, the strftime/strptime directives. HTH!

How to create a list from json key:values in python3

I'm looking to create a python3 list of the locations from the json file city.list.json downloaded from OpenWeatherMaps http://bulk.openweathermap.org/sample/city.list.json.gz. The file passes http://json-validator.com/ but I can not figure out how to correctly open the file and create a list of values of key 'name'. I keep hitting json.loads errors about io.TextIOWrapper etc.
I created a short test file
[
{
"id": 707860,
"name": "Hurzuf",
"country": "UA",
"coord": {
"lon": 34.283333,
"lat": 44.549999
}
}
,
{
"id": 519188,
"name": "Novinki",
"country": "RU",
"coord": {
"lon": 37.666668,
"lat": 55.683334
}
}
]
Is there a way to parse this and create a list ["Hurzuf", "Novinki"] ?
You should use json.load() instead of json.loads(). I named my test file file.json and here is the code:
import json
with open('file.json', mode='r') as f:
# At first, read the JSON file and store its content in an Python variable
# By using json.load() function
json_data = json.load(f)
# So now json_data contains list of dictionaries
# (because every JSON is a valid Python dictionary)
# Then we create a result list, in which we will store our names
result_list = []
# We start to iterate over each dictionary in our list
for json_dict in json_data:
# We append each name value to our result list
result_list.append(json_dict['name'])
print(result_list) # ['Hurzuf', 'Novinki']
# Shorter solution by using list comprehension
result_list = [json_dict['name'] for json_dict in json_data]
print(result_list) # ['Hurzuf', 'Novinki']
You just simply iterate over elements in your list and check whether the key is equal to name.