Databricks - explode JSON from SQL column with PySpark - json

New to Databricks. Have a SQL database table that I am creating a dataframe from. One of the columns is a JSON string. I need to explode the nested JSON into multiple columns. Have used this post and this post to get me to where I am at now.
Example JSON:
{
"Module": {
"PCBA Serial Number": "G7456789",
"Manufacturing Designator": "DISNEY",
"Firmware Version": "0.0.0",
"Hardware Revision": "46858",
"Manufacturing Date": "10/17/2018 4:04:25 PM",
"Test Result": "Fail",
"Test Start Time": "10/22/2018 6:14:14 AM",
"Test End Time": "10/22/2018 6:16:11 AM"
}
Code so far:
#define schema
schema = StructType(
[
StructField('Module',ArrayType(StructType(Seq
StructField('PCBA Serial Number',StringType,True),
StructField('Manufacturing Designator',StringType,True),
StructField('Firmware Version',StringType,True),
StructField('Hardware Revision',StringType,True),
StructField('Test Result',StringType,True),
StructField('Test Start Time',StringType,True),
StructField('Test End Time',StringType,True))), True) ,True),
StructField('Test Results',StringType(),True),
StructField('HVM Code Errors',StringType(),True)
]
#use from_json to explode json by applying it to column
df.withColumn("ActivityName", from_json("ActivityName", schema))\
.select(col('ActivityName'))\
.show()
Error:
SyntaxError: invalid syntax
File "<command-1632344621139040>", line 10
StructField('PCBA Serial Number',StringType,True),
^
SyntaxError: invalid syntax

As you are using pyspark then types should be StringType() instead of StringType and remove Seq replace it with []
schema = StructType([StructField('Module',ArrayType(StructType([
StructField('PCBA Serial Number',StringType(),True),
StructField('Manufacturing Designator',StringType(),True),
StructField('Firmware Version',StringType(),True),
StructField('Hardware Revision',StringType(),True),
StructField('Test Result',StringType(),True),
StructField('Test Start Time',StringType(),True),
StructField('Test End Time',StringType(),True)])), True),
StructField('Test Results',StringType(),True),
StructField('HVM Code Errors',StringType(),True)])

Related

Analysing and formatting JSON using PostgreSQL

I have a table called api_details where i dump the below JSON value into the JSON column raw_data.
Now i need to make a report from this JSON string and the expected output is something like below,
action_name. sent_timestamp Sent. Delivered
campaign_2475 1600416865.928737 - 1601788183.440805. 7504. 7483
campaign_d_1084_SUN15_ex 1604220248.153903 - 1604222469.087918. 63095. 62961
Below is the sample JSON OUTPUT
{
"header": [
"#0 action_name",
"#1 sent_timestamp",
"#0 Sent",
"#1 Delivered"
],
"name": "campaign - lifetime",
"rows": [
[
"campaign_2475",
"1600416865.928737 - 1601788183.440805",
7504,
7483
],
[
"campaign_d_1084_SUN15_ex",
"1604220248.153903 - 1604222469.087918",
63095,
62961
],
[
"campaign_SUN15",
"1604222469.148829 - 1604411016.029794",
63303,
63211
]
],
"success": true
}
I tried like below, but is not getting the results.I can do it using python by lopping through all the elements in row list.
But is there an easy solution in PostgreSQL(version 11).
SELECT raw_data->'rows'->0
FROM api_details
You can use JSONB_ARRAY_ELEMENTS() function such as
SELECT (j.value)->>0 AS action_name,
(j.value)->>1 AS sent_timestamp,
(j.value)->>2 AS Sent,
(j.value)->>3 AS Delivered
FROM api_details
CROSS JOIN JSONB_ARRAY_ELEMENTS(raw_data->'rows') AS j
Demo
P.S. in this case the data type of raw_data is assumed to be JSONB, otherwise the argument within the function raw_data->'rows' should be replaced with raw_data::JSONB->'rows' in order to perform explicit type casting.

how to generate schema from a newline delimited JSON file in python

I want to generate schema from a newline delimited JSON file, having each row in the JSON file has variable-key/value pairs. File size can vary from 5 MB to 25 MB.
Sample Data:
{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}
Exptected Schema:
[
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "STRING"},
{"name": "col3", "type": "FLOAT"},
{"name": "col4", "type": "DATE"}
]
Notes:
There is no scope to use any tool, as files loaded into an inbound location dynamically. The code will use to trigger an event as-soon-as file arrives and perform schema comparison.
Your first problem is, that json does not have a date-type. So you will get str there.
What I would do, if I was you is this:
import json
# Wherever your input comes from
inp = """{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}"""
schema = {}
# Split it at newlines
for line in inp.split('\n'):
# each line contains a "dict"
tmp = json.loads(line)
for key in tmp:
# if we have not seen the key before, add it
if key not in schema:
schema[key] = type(tmp[key])
# otherwise check the type
else:
if schema[key] != type(tmp[key]):
raise Exception("Schema mismatch")
# format however you like
out = []
for item in schema:
out.append({"name": item, "type": schema[item].__name__})
print(json.dumps(out, indent=2))
I'm using python types for simplicity, but you can write your own function to get the type, e.g. if you want to check if a string is actually a date.

Cleaning of JSON Objects using Spark

I have been trying to clean my json file.I used RDD to read the Json file and then tried to clean it using replace function but still I am not getting the correct json file because of the escape sequences present in the JSON value.
Here is my code with which I am trying to clean the JSON file of various disturbances.
The cleaned JSON shows errors.Please review and tell the issue**
val readjson = sparkSession
.sparkContext.textFile("dev.json")
val json=readjson.map(element=>element
.replace("\"\":\"\"","\":\"")
.replace("\"\",\"\"","\",\"")
.replace("\"\":","\":")
.replace(",\"\"",",\"")
.replace("\"{\"\"","{\"")
.replace("\"\"}\"","\"}"))
.saveAsTextFile("JSON")
HERE IS MY JSON FILE
"{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":"""",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":"""",""GENDR_CD"":"""",""PROV_NPI"":"""",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":"""",""PROFPROFL_DESGTN"":"""",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"
I tried cleaning the above json and got the following result:-
{
"SEQ_NO": 596514,
"PROV_DEMOG_SK": 596514,
"PROV_ID": "QMP000003370581",
"FRST_NM": "",
"LAST_NM": "RICHARD WHITTINGTON BUTCHER",
"FUL_NM": "",
"GENDR_CD": "",
"PROV_NPI": "",
"PROV_STAT": "Incomplete",
"PROV_TY": "03",
"DT_OF_BRTH": "",
"PROFPROFL_DESGTN": "",
"ETL_LAST_UPDT_DT_TM": "2020-04-28 11:43:31.000000",
"PROV_CLSFTN_CD": "A",
"SRC_DATA_KEY": 50,
"OPRN_CD": "I",
"REC_SET": "F"
}
The JSON validators present online show that it is incorrect
Looks like your JSON has one or few control character \u0009 try replacing them with
.replaceAll("\\u0009"," ")
You can do it in below sequence
val replacedVal = """{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":\"\"\"",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":\"\"\"",""GENDR_CD"":\"\"\"",""PROV_NPI"":\"\"\"",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":\"\"\"",""PROFPROFL_DESGTN"":\"\"\"",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"""
.replace("""\"""",""""""")
.replace("""""""",""""""")
.replaceAll("\\u0009"," ")

Fetch all values irrespective of keys from a column of JSON type in a Spark dataframe using Spark with scala

I was trying to load a TSV file containing some metadata on movies using Spark, this TSV file contains genre info on movies in JSON format [ The last column in every row ]
Sample File
975900 /m/03vyhn Ghosts of Mars 2001-08-24 14010832 98.0 {"/m/02h40lc": "English Language"} {"/m/09c7w0": "United States of America"} {"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}
3196793 /m/08yl5d Getting Away with Murder: The JonBenét Ramsey Mystery 2000-02-16 95.0 {"/m/02h40lc": "English Language"} {"/m/09c7w0": "United States of America"} {"/m/02n4kr": "Mystery", "/m/03bxz7": "Biographical film", "/m/07s9rl0": "Drama", "/m/0hj3n01": "Crime Drama"}
I had tried the below code which enables me to access a particular value from the genre JSON
val ss = SessionCreator.createSession("DataCleaning", "local[*]")//helper function creates a spark session and returns it
val headerInfoRb = ResourceBundle.getBundle("conf.headerInfo")
val movieDF = DataReader.readFromTsv(ss, "D:/Utility/Datasets/MovieSummaries/movie.metadata.tsv")
.toDF(headerInfoRb.getString("metadataReader").split(',').toSeq:_*)//Datareader.readFromTsv is a helper function to read TSV file ,takes sparkSession and file path as input to resurn a dataframe, which uses sparkSession's read function
movieDF.select("wiki_mv_id","mv_nm","mv_genre")
.withColumn("genre_frmttd", get_json_object(col("mv_genre"), "$./m/02kdv5l"))
.show(1,false)
Output
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|wiki_mv_id|mv_nm |mv_genre |genre_frmttd|
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|975900 |Ghosts of Mars|{"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}|Action |
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
only showing top 1 row
I want genre_frmttd column in the way shown below for every row in the data Frame [ below snippet is for the first sample row ]
[Thriller,Fiction,Horror,Adventure,Supernatural,Action,Space Western]
I am bit of a rookie in scala and spark, do suggest some way to list out the values
parse the JSON using from_json
cast it to MapType(StringType, StringType)
extract only values using map_values
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
movieDF.select("wiki_mv_id","mv_nm","mv_genre")
.withColumn("genre_frmttd",map_values(from_json(col("mv_genre"),MapType(StringType, StringType))))
.show(1,false)

Reading a json file into a RDD (not dataFrame) using pyspark

I have the following file: test.json >
{
"id": 1,
"name": "A green door",
"price": 12.50,
"tags": ["home", "green"]
}
I want to load this file into a RDD. This is what I tried:
rddj = sc.textFile('test.json')
rdd_res = rddj.map(lambda x: json.loads(x))
I got an error:
Expecting object: line 1 column 1 (char 0)
I don't completely understand what does json.loads do.
How can I resolve this problem ?
textFile reads data line by line. Individual lines of your input are not syntactically valid JSON.
Just use json reader:
spark.read.json("test.json", multiLine=True)
or (not recommended) whole text files
sc.wholeTextFiles("test.json").values().map(json.loads)