Cleaning of JSON Objects using Spark - json

I have been trying to clean my json file.I used RDD to read the Json file and then tried to clean it using replace function but still I am not getting the correct json file because of the escape sequences present in the JSON value.
Here is my code with which I am trying to clean the JSON file of various disturbances.
The cleaned JSON shows errors.Please review and tell the issue**
val readjson = sparkSession
.sparkContext.textFile("dev.json")
val json=readjson.map(element=>element
.replace("\"\":\"\"","\":\"")
.replace("\"\",\"\"","\",\"")
.replace("\"\":","\":")
.replace(",\"\"",",\"")
.replace("\"{\"\"","{\"")
.replace("\"\"}\"","\"}"))
.saveAsTextFile("JSON")
HERE IS MY JSON FILE
"{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":"""",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":"""",""GENDR_CD"":"""",""PROV_NPI"":"""",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":"""",""PROFPROFL_DESGTN"":"""",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"
I tried cleaning the above json and got the following result:-
{
"SEQ_NO": 596514,
"PROV_DEMOG_SK": 596514,
"PROV_ID": "QMP000003370581",
"FRST_NM": "",
"LAST_NM": "RICHARD WHITTINGTON BUTCHER",
"FUL_NM": "",
"GENDR_CD": "",
"PROV_NPI": "",
"PROV_STAT": "Incomplete",
"PROV_TY": "03",
"DT_OF_BRTH": "",
"PROFPROFL_DESGTN": "",
"ETL_LAST_UPDT_DT_TM": "2020-04-28 11:43:31.000000",
"PROV_CLSFTN_CD": "A",
"SRC_DATA_KEY": 50,
"OPRN_CD": "I",
"REC_SET": "F"
}
The JSON validators present online show that it is incorrect

Looks like your JSON has one or few control character \u0009 try replacing them with
.replaceAll("\\u0009"," ")
You can do it in below sequence
val replacedVal = """{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":\"\"\"",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":\"\"\"",""GENDR_CD"":\"\"\"",""PROV_NPI"":\"\"\"",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":\"\"\"",""PROFPROFL_DESGTN"":\"\"\"",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"""
.replace("""\"""",""""""")
.replace("""""""",""""""")
.replaceAll("\\u0009"," ")

Related

Python Regex: How to match the string and then modify that string by adding something at the end

UPDATED CODE: It is working but now the problem is that the code is attaching same random_value to every Path.
Following is my code with a sample chunk of text. I want to read Path and it's value then add (/some unique random alphabet and number combination) at the end of every Path value without changing the already existed value. For example I want the Path to be like
"Path" : "already existed value/1A" e.t.c something like that.
I am unable to make the exact regex pattern of replacing it.
Any help would be appreciated.
It can be done by json parse but the requirement of the task is to do it via REGEX.
from io import StringIO
import re
import string
import random
reader = StringIO("""{
"Bounds": [
{
"HasClip": true,
"Lang": "no",
"Page": 0,
"Path": "//Document/Sect[2]/Aside/P",
"Text": "Potsdam, den 9. Juni 2021 ",
"TextSize": 12.0
}
],
},
{
"Bounds": [
{
"HasClip": true,
"Lang": "de",
"Page": 0,
"Path": "//Document/Sect[3]/P[4]",
"Text": "this is some text ",
"TextSize": 9.0,
}
],
}""")
def id_generator(size=3, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
text = reader.read()
random_value = id_generator()
pattern = r'"Path": "(.*?)"'
replacement = '"Path": "\\1/'+random_value+'"'
text = re.sub(pattern, replacement, text)
#This is working but it is only attaching one same random_value on every Path
print(text)
Use group 1 in the replacement:
replacement = '"Path": "\\1/1A"'
See live demo.
The replacement regex \1 puts back what was captured in group 1 of the match via (.*?).
Since you already have a json structure, maybe it would help to use the json module to parse it.
import json
myDict = json.loads("your json string / variable here")
# now myDict is a dictionary that you can use to loop/read/edit/modify and you can then export myDict as json.

Databricks - explode JSON from SQL column with PySpark

New to Databricks. Have a SQL database table that I am creating a dataframe from. One of the columns is a JSON string. I need to explode the nested JSON into multiple columns. Have used this post and this post to get me to where I am at now.
Example JSON:
{
"Module": {
"PCBA Serial Number": "G7456789",
"Manufacturing Designator": "DISNEY",
"Firmware Version": "0.0.0",
"Hardware Revision": "46858",
"Manufacturing Date": "10/17/2018 4:04:25 PM",
"Test Result": "Fail",
"Test Start Time": "10/22/2018 6:14:14 AM",
"Test End Time": "10/22/2018 6:16:11 AM"
}
Code so far:
#define schema
schema = StructType(
[
StructField('Module',ArrayType(StructType(Seq
StructField('PCBA Serial Number',StringType,True),
StructField('Manufacturing Designator',StringType,True),
StructField('Firmware Version',StringType,True),
StructField('Hardware Revision',StringType,True),
StructField('Test Result',StringType,True),
StructField('Test Start Time',StringType,True),
StructField('Test End Time',StringType,True))), True) ,True),
StructField('Test Results',StringType(),True),
StructField('HVM Code Errors',StringType(),True)
]
#use from_json to explode json by applying it to column
df.withColumn("ActivityName", from_json("ActivityName", schema))\
.select(col('ActivityName'))\
.show()
Error:
SyntaxError: invalid syntax
File "<command-1632344621139040>", line 10
StructField('PCBA Serial Number',StringType,True),
^
SyntaxError: invalid syntax
As you are using pyspark then types should be StringType() instead of StringType and remove Seq replace it with []
schema = StructType([StructField('Module',ArrayType(StructType([
StructField('PCBA Serial Number',StringType(),True),
StructField('Manufacturing Designator',StringType(),True),
StructField('Firmware Version',StringType(),True),
StructField('Hardware Revision',StringType(),True),
StructField('Test Result',StringType(),True),
StructField('Test Start Time',StringType(),True),
StructField('Test End Time',StringType(),True)])), True),
StructField('Test Results',StringType(),True),
StructField('HVM Code Errors',StringType(),True)])

How do you print multiple key values from sub keys in a .json file?

Im pulling a list of AMI ids from my AWS account and its being written into a json file.
The json looks basically like this:
{
"Images": [
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-XXXXXXXX"
},
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-aaaaaaaa"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-bbbbbbb"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-cccccccc"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-ddddddd"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-eeeeeeee"
}
]
}
My code looks like this so far after gathering the info and writing it to a .json file locally:
#writes json output to file...
print('writing to response.json...')
with open('response.json', 'w') as outfile:
json.dump(response, outfile, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))
#Searches file...
print('opening response.json...')
with open("response.json") as f:
file_parsed = json.load(f)
The next part im stuck on is how to iterate through the file and print only the CreationDate and ImageId values.
print('printing CreationDate and ImageId...')
for ami in file_parsed['Images']:
#print ami['CreationDate'] #THIS WORKS
#print ami['ImageId'] #THIS WORKS
#print ami['CreationDate']['ImageId']
The last line there gives me this no matter how I have tried it: TypeError: string indices must be integers
My desired output is something like this:
2017-11-24T11:05:32.000Z ami-XXXXXXXX
Ultimately what im looking to do is then iterate through lines that are a certain date or older and deregister those AMIs. So would I be converting these to a list or a dict?
Pretty much not a programmer here so dont drown me.
TIA
You have almost parsed the json but for the desired output you need to concatenate the 'CreationDate' and 'ImageId' like this:
for ami in file_parsed['Images']:
print(ami['CreationDate'] + " "+ ami['ImageId'])
CreationDate evaluates to a string. So you can only take numerical indices of a string which is why ['CreationDate']['ImageId'] leads to a TypeError. Your other two commented lines, however, were correct.
To check if the date is older, you can make use of the datetime module. For instance, you can take the CreationDate (which is a string), convert it to a datetime object, create your own based on what that certain date is, and compare the two.
Something to this effect:
def checkIfOlder(isoformat, targetDate):
dateAsString = datetime.strptime(isoformat, '%Y-%m-%dT%H:%M:%S.%fZ')
return dateAsString <= targetDate
certainDate = datetime(2017, 11, 30) # Or whichever date you want
So in your for loop:
for ami in file_parsed['Images']:
creationDate = ami['CreationDate']
if checkIfOlder(creationDate, certainDate):
pass # write code to deregister AMIs here
Resources that would benefit would be Python's datetime documentation and in particular, the strftime/strptime directives. HTH!

Reading a json file into a RDD (not dataFrame) using pyspark

I have the following file: test.json >
{
"id": 1,
"name": "A green door",
"price": 12.50,
"tags": ["home", "green"]
}
I want to load this file into a RDD. This is what I tried:
rddj = sc.textFile('test.json')
rdd_res = rddj.map(lambda x: json.loads(x))
I got an error:
Expecting object: line 1 column 1 (char 0)
I don't completely understand what does json.loads do.
How can I resolve this problem ?
textFile reads data line by line. Individual lines of your input are not syntactically valid JSON.
Just use json reader:
spark.read.json("test.json", multiLine=True)
or (not recommended) whole text files
sc.wholeTextFiles("test.json").values().map(json.loads)

How to create a list from json key:values in python3

I'm looking to create a python3 list of the locations from the json file city.list.json downloaded from OpenWeatherMaps http://bulk.openweathermap.org/sample/city.list.json.gz. The file passes http://json-validator.com/ but I can not figure out how to correctly open the file and create a list of values of key 'name'. I keep hitting json.loads errors about io.TextIOWrapper etc.
I created a short test file
[
{
"id": 707860,
"name": "Hurzuf",
"country": "UA",
"coord": {
"lon": 34.283333,
"lat": 44.549999
}
}
,
{
"id": 519188,
"name": "Novinki",
"country": "RU",
"coord": {
"lon": 37.666668,
"lat": 55.683334
}
}
]
Is there a way to parse this and create a list ["Hurzuf", "Novinki"] ?
You should use json.load() instead of json.loads(). I named my test file file.json and here is the code:
import json
with open('file.json', mode='r') as f:
# At first, read the JSON file and store its content in an Python variable
# By using json.load() function
json_data = json.load(f)
# So now json_data contains list of dictionaries
# (because every JSON is a valid Python dictionary)
# Then we create a result list, in which we will store our names
result_list = []
# We start to iterate over each dictionary in our list
for json_dict in json_data:
# We append each name value to our result list
result_list.append(json_dict['name'])
print(result_list) # ['Hurzuf', 'Novinki']
# Shorter solution by using list comprehension
result_list = [json_dict['name'] for json_dict in json_data]
print(result_list) # ['Hurzuf', 'Novinki']
You just simply iterate over elements in your list and check whether the key is equal to name.