I have a JSON file that I am reading in spark.
The schema is getting displayed however when I am trying to read the info column or any sub-element it is always NULL. (which is not NULL)
//reading file
val df = spark.read.json("FilePath")
df.printSchema()
root
|-- data_is: boolean (nullable = true)
|-- Student: struct (nullable = true)
| |-- Id: string (nullable = true)
| |-- JoinDate: string (nullable = true)
| |-- LeaveDate: string (nullable = true)
|-- Info: struct (nullable = true)
| |-- details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Father_salary: double (nullable = true)
| | | |-- Mother_salary: double (nullable = true)
| | | |-- Address: String (nullable = true)
| |-- studentInfo: struct (nullable = true)
| | |-- Age: double (nullable = true)
| | |-- Name: String (nullable = true)
df.select("Student").show()
shows the filed value in Student element
even when I parse Student.Id I can get the ID
But whenever parsing the Info, I am always getting a NULL value which is not NULL in the file.
df.select("Info").show() // is showing as NULL
df.select("Info.detail").show() // is showing as NULL
even Info.Summary is NULL.
Can anybody suggest how to get the actual field value instead of NULL?
JSON File
{"Student":{"JoinDate":"20200909","LeaveDate":"20200909","id":"XA12"},"Info":{"studentInfo":{"Age":13,"Name":"Alex"},"details":[{"Father_salary":1234.00,"Mother_salary":0,"Address":""}]},"data_is":true}
Related
I am trying to get the value of "__delta" from the following JSON schema that has been loaded to a dataframe. How do I do that in Pyspark?
root
|-- d: struct (nullable = true)
| |-- __delta: string (nullable = true)
| |-- __next: string (nullable = true)
| |-- results: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- ABRVW: string (nullable = true)
| | | |-- ADRNR: string (nullable = true)
| | | |-- ANRED: string (nullable = true)
with the struct type JSON object just select the object with the attribute you want to get.
df.select("d.__delta")
How about df.select($"d.__delta")
I have a Dataframe with the following schema, where 'name' is a string type and the value
is a complex JSON with Array and struct.
Basically with string datatype i couldn't able to parse the data and write into rows.
So I am trying to convert datatype and apply explode to parse the data.
Current:
root
|--id: string (nullable = true)
|--partitionNo: string (nullable = true)
|--name: string (nullable = true)
After conversion:
Expected:
root
|id: string (nullable = true)
|partitionNo: string (nullable = true)
|name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- extension: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- url: string (nullable = true)
| | | | |-- valueMetadata: struct (nullable = true)
| | | | |-- modifiedDateTime: string (nullable = true)
| | | | |-- code: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- firstName: array (nullable = true)
| | | |-- element: string (containsNull = true)
You can use from_json, but you need to provide a schema, which can be automatically inferred using a spaghetti code... because from_json only accepts a schema in the form of lit:
val df2 = df.withColumn(
"name",
from_json(
$"name",
// the lines below generate the schema
lit(
df.select(
schema_of_json(
lit(
df.select($"name").head()(0)
)
)
).head()(0)
)
// end of schema generation
)
)
I have a parquet file , one of the column of it is a struct field which stores the json. The struct is shown as below.
originator: struct (nullable = true)
|-- originatorDetail: struct (nullable = true)
| |-- applicationDeployedId: string (nullable = true)
| |-- applicationDeployedNameVersion: string (nullable = true)
| |-- applicationNameVersion: string (nullable = true)
| |-- cloudHost: string (nullable = true)
| |-- cloudRegion: string (nullable = true)
| |-- cloudStack: string (nullable = true)
| |-- version: string (nullable = true)
|-- Orversion: string (nullable = true)
Only the version field is required in json and others are non required fields.
So some of the records might have only 2 element and still be valid.
Suppose i want to read the CloudHost field. I can read it as originator.originatorDetail.cloudHost. But for records where this non required field is not present. It would fail as the element is not there. Is there any way I can read these non required value as null for records where the values are not present without using a udf.
Some examples
originator": {
"originatorDetail": {
"applicationDeployedId": "PSLV",
"cloudRegion": "Mangal",
"cloudHost": "Petrol",
"applicationNameVersion": "CRDI",
"applicationDeployedNameVersion": "Tuna",
"cloudStack": "DEV",
"version": "1.1.0"
},
Orversion": "version.1"
}
-------------
originator": {
"originatorDetail": {
"version": "1.1.0"
},
Orversion": "version.1"
}
Required Output
applicationDeployedId applicationDeployedNameVersion applicationNameVersion cloudHost cloudRegion cloudStack version Orversion
PSLV Tuna CRDI Petrol Mangal DEV 1.1.0 version.1
1.1.0 version.1
Use from_json function from Spark-2.4+
Read the parquet data then use from_json by passing the schema that matches with your json column.
Spark will read the matching data and adds non matching fields with null values.
Example:
df.show(10,False)
#+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|id |json_data #|
#+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|1 |{"originator": {"originatorDetail": {"applicationDeployedId": "PSLV","cloudRegion": "Mangal","cloudHost": "Petrol","applicationNameVersion": "CRDI","applicationDeployedNameVersion": "Tuna","cloudStack": "DEV","version": "1.1.0"},"Orversion": "version.1"}}|
#|2 |{"originator": { "originatorDetail": { "version": "1.1.0" }, "Orversion": "version.1"}} |
#+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
schema=StructType([StructField("originator",StructType([StructField("Orversion",StringType(),True),
StructField("originatorDetail",StructType([StructField("applicationDeployedId",StringType(),True),
StructField("applicationDeployedNameVersion",StringType(),True),
StructField("applicationNameVersion",StringType(),True),
StructField("cloudHost",StringType(),True),
StructField("cloudRegion",StringType(),True),
StructField("cloudStack",StringType(),True),
StructField("version",StringType(),True)]),True)]),True)])
from pyspark.sql.functions import *
from pyspark.sql.types import *
#then read the json_data column using from_json function
df.withColumn("json_converted",from_json(col("json_data"),schema)).select("id","json_converted").show(10,False)
#+---+--------------------------------------------------------+
#|id |json_converted |
#+---+--------------------------------------------------------+
#|1 |[[version.1, [PSLV, Tuna,, Petrol, Mangal, DEV, 1.1.0]]]|
#|2 |[[version.1, [,,,,,, 1.1.0]]] |
#+---+--------------------------------------------------------+
df.withColumn("json_converted",from_json(col("json_data"),schema)).select("id","json_converted").printSchema()
#root
# |-- id: long (nullable = true)
# |-- json_converted: struct (nullable = true)
# | |-- originator: struct (nullable = true)
# | | |-- Orversion: string (nullable = true)
# | | |-- originatorDetail: struct (nullable = true)
# | | | |-- applicationDeployedId: string (nullable = true)
# | | | |-- applicationDeployedNameVersion: string (nullable = true)
# | | | |-- applicationNameVersi: string (nullable = true)
# | | | |-- cloudHost: string (nullable = true)
# | | | |-- cloudRegion: string (nullable = true)
# | | | |-- cloudStack: string (nullable = true)
# | | | |-- version: string (nullable = true)
#even though we don't have all fields from id=2 still we added fields
df.withColumn("json_converted",from_json(col("json_data"),schema)).select("json_converted.originator.originatorDetail.applicationDeployedId").show(10,False)
#+---------------------+
#|applicationDeployedId|
#+---------------------+
#|PSLV |
#|null |
#+---------------------+
I'm using spark structured streaming to read kafka topic and want to convert following complex JSON (kafka-msgs) in to dataframe having "NAME,ADDRESS,DESCRIPTION,CODE,DEPARTMENT,INFA_OP_TYPE,DTL__CAPXTIMESTAMP" columns.
{
"meta_data": [{"name":{"string":"INFA_SEQUENCE"},"value":
{"string":"2,PWX_GENERIC"},"type":null},
{"name":{"string":"INFA_TABLE_NAME"},"value":{"string":"customers"},"type":null},
{"name":{"string":"INFA_OP_TYPE"},"value":{"string":"INSERT_EVENT"},"type":null},
{"name":{"string":"DTL__CAPXRESTART1"},"value":{"string":"B+IABwAfA"},"type":null},
{"name":{"string":"DTL__CAPXRESTART2"},"value":{"string":"AAABpMwgRDk="},"type":null},
{"name":{"string":"DTL__CAPXUOW"},"value":{"string":"AAMKPgAAqaIABg=="},"type":null},
{"name":{"string":"DTL__CAPXUSER"},"value":null,"type":null},
{"name":{"string":"DTL__CAPXTIMESTAMP"},"value":{"string":"201807310934257270000000"},"type":null},
{"name":{"string":"DTL__CAPXACTION"},"value":{"string":"I"},"type":null}],
"columns":{"array":[{"name":{"string":"NAME"},"value":{"string":"ABCD"},"isPresent":{"boolean":true}},
{"name":{"string":"ADDRESS"},"value":{"string":"123,Bark street"},"isPresent":{"boolean":true}},
{"name":{"string":"DESCRIPTION"},"value":{"string":"Canadian"},"isPresent":{"boolean":true}},
{"name":{"string":"CODE"},"value":{"string":"3_1"},"isPresent":{"boolean":true}},
{"name":{"string":"DEPARTMENT"},"value":{"string":"HR"},"isPresent":{"boolean":true}}
] }
}
I'm able to extract two json object "meta_data" and "columns" but I'm unable to explode "columns.array"
newJsonObj = events.select(get_json_object(events.value,'$.meta_data').alias('meta_data'),get_json_object(events.value,'$.columns.array').alias('columns'))
And I don't know how to extract values from two json object and create dataframe having columns from both json object.
-- Schema of events dataframe --
root
|-- columns: struct (nullable = true)
| |-- array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- isPresent: struct (nullable = true)
| | | | |-- boolean: boolean (nullable = true)
| | | |-- name: struct (nullable = true)
| | | | |-- string: string (nullable = true)
| | | |-- value: struct (nullable = true)
| | | | |-- string: string (nullable = true)
|-- meta_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: struct (nullable = true)
| | | |-- string: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- value: struct (nullable = true)
| | | |-- string: string (nullable = true)
I was trying to get the data from json data which I got it from wiki api
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Rajanna&rvsection=0
I was able to print the schema of that exactly
scala> data.printSchema
root
|-- batchcomplete: string (nullable = true)
|-- query: struct (nullable = true)
| |-- pages: struct (nullable = true)
| | |-- 28597189: struct (nullable = true)
| | | |-- ns: long (nullable = true)
| | | |-- pageid: long (nullable = true)
| | | |-- revisions: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- *: string (nullable = true)
| | | | | |-- contentformat: string (nullable = true)
| | | | | |-- contentmodel: string (nullable = true)
| | | |-- title: string (nullable = true)
I want to extract the data of the key "*" |-- *: string (nullable = true)
Please suggest me a solution.
One problem is
pages: struct (nullable = true)
| | |-- 28597189: struct (nullable = true)
the number 28597189 is unique to every title.
First we need to parse the json to get the key (28597189) dynamically then use this to extract the data from spark dataframe like below
val keyName = dataFrame.selectExpr("query.pages.*").schema.fieldNames(0)
println(s"Key Name : $keyName")
this will give you the key dynamically:
Key Name : 28597189
Then use this to extract the data
var revDf = dataFrame.select(explode(dataFrame(s"query.pages.$keyName.revisions")).as("revision")).select("revision.*")
revDf.printSchema()
Output:
root
|-- *: string (nullable = true)
|-- contentformat: string (nullable = true)
|-- contentmodel: string (nullable = true)
and we will be renaming the column * with some key name like star_column
revDf = revDf.withColumnRenamed("*", "star_column")
revDf.printSchema()
Output:
root
|-- star_column: string (nullable = true)
|-- contentformat: string (nullable = true)
|-- contentmodel: string (nullable = true)
and once we have our final dataframe we will call show
revDf.show()
Output:
+--------------------+-------------+------------+
| star_column|contentformat|contentmodel|
+--------------------+-------------+------------+
|{{EngvarB|date=Se...| text/x-wiki| wikitext|
+--------------------+-------------+------------+