Apache spark: Write JSON DataFrame partitionBy nested columns

Apache spark: Write JSON DataFrame partitionBy nested columns - json

I have this kind of JSON data:
{
"data": [
{
"id": "4619623",
"team": "452144",
"created_on": "2018-10-09 02:55:51",
"links": {
"edit": "https://some_page",
"publish": "https://some_publish",
"default": "https://some_default"
}
},
{
"id": "4619600",
"team": "452144",
"created_on": "2018-10-09 02:42:25",
"links": {
"edit": "https://some_page",
"publish": "https://some_publish",
"default": "https://some_default"
}
}
}
I read this data using Apache spark and I want to write them partition by id column. When I use this:
df.write.partitionBy("data.id").json(<path_to_folder>)
I will get error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Partition column data.id not found in schema
I also tried to use explode function like that:
import org.apache.spark.sql.functions.{col, explode}
val renamedDf= df.withColumn("id", explode(col("data.id")))
renamedDf.write.partitionBy("id").json(<path_to_folder>)
That actually helped, but each id partition folder contained the same original JSON file.
EDIT: schema of df DataFrame:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
Schema of renamedDf DataFrame:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
|-- id: string (nullable = true)
I am using spark 2.1.0
I found this solution: DataFrame partitionBy on nested columns
And this example:http://bigdatums.net/2016/02/12/how-to-extract-nested-json-data-in-spark/
But none of this helped me to solve my problem.
Thanks in andvance for any help.

try the following code:
val renamedDf = df
.select(explode(col("data")) as "x" )
.select($"x.*")
renamedDf.write.partitionBy("id").json(<path_to_folder>)

You are just missing a select statement after the initial explode
val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("/FileStore/tables/test.json")
df.printSchema
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
| | |-- team: string (nullable = true)
import org.apache.spark.sql.functions.{col, explode}
val df1= df.withColumn("data", explode(col("data")))
df1.printSchema
root
|-- data: struct (nullable = true)
| |-- created_on: string (nullable = true)
| |-- id: string (nullable = true)
| |-- links: struct (nullable = true)
| | |-- default: string (nullable = true)
| | |-- edit: string (nullable = true)
| | |-- publish: string (nullable = true)
| |-- team: string (nullable = true)
val df2 = df1.select("data.created_on","data.id","data.team","data.links")
df2.show
+-------------------+-------+------+--------------------+
| created_on| id| team| links|
+-------------------+-------+------+--------------------+
|2018-10-09 02:55:51|4619623|452144|[https://some_def...|
|2018-10-09 02:42:25|4619600|452144|[https://some_def...|
+-------------------+-------+------+--------------------+
df2.write.partitionBy("id").json("FileStore/tables/test_part.json")
val f = spark.read.json("/FileStore/tables/test_part.json/id=4619600")
f.show
+-------------------+--------------------+------+
| created_on| links| team|
+-------------------+--------------------+------+
|2018-10-09 02:42:25|[https://some_def...|452144|
+-------------------+--------------------+------+
val full = spark.read.json("/FileStore/tables/test_part.json")
full.show
+-------------------+--------------------+------+-------+
| created_on| links| team| id|
+-------------------+--------------------+------+-------+
|2018-10-09 02:55:51|[https://some_def...|452144|4619623|
|2018-10-09 02:42:25|[https://some_def...|452144|4619600|
+-------------------+--------------------+------+-------+

Related

Define a schema from a DF column in array type

I have a metadata file with a column with information on the schema of a file:
[{"column_datatype": "varchar", "column_description": "Indicates whether the Customer belongs to a particular business size, business activity, retail segment, demography, or other group and is used for reporting on regio performance regio migration.", "column_length": "4", "column_name": "clnt_grp_cd", "column_personally_identifiable_information": "False", "column_precision": "4", "column_primary_key": "True", "column_scale": null, "column_security_classifications": [], "column_sequence_number": "1"}
root
|-- column_info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- column_datatype: string (nullable = true)
| | |-- column_description: string (nullable = true)
| | |-- column_length: string (nullable = true)
| | |-- column_name: string (nullable = true)
| | |-- column_personally_identifiable_information: string (nullable = true)
| | |-- column_precision: string (nullable = true)
| | |-- column_primary_key: string (nullable = true)
| | |-- column_scale: string (nullable = true)
| | |-- column_security_classifications: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- column_sequence_number: string (nullable = true)
I want to read a df using this schema. Something like:
schema = StructType([ \
StructField("clnt_grp_cd",StringType(),True),\
StructField("clnt_grp_lvl1_nm",StringType(),True),\
(...)
])
df = spark.read.schema(schema).format("csv").option("header","true").load(filenamepath)
Is there a built in method to parse this as a schema?

Read Pyspark Struct Json Column Non Required elements

I have a parquet file , one of the column of it is a struct field which stores the json. The struct is shown as below.
originator: struct (nullable = true)
|-- originatorDetail: struct (nullable = true)
| |-- applicationDeployedId: string (nullable = true)
| |-- applicationDeployedNameVersion: string (nullable = true)
| |-- applicationNameVersion: string (nullable = true)
| |-- cloudHost: string (nullable = true)
| |-- cloudRegion: string (nullable = true)
| |-- cloudStack: string (nullable = true)
| |-- version: string (nullable = true)
|-- Orversion: string (nullable = true)
Only the version field is required in json and others are non required fields.
So some of the records might have only 2 element and still be valid.
Suppose i want to read the CloudHost field. I can read it as originator.originatorDetail.cloudHost. But for records where this non required field is not present. It would fail as the element is not there. Is there any way I can read these non required value as null for records where the values are not present without using a udf.
Some examples
originator": {
"originatorDetail": {
"applicationDeployedId": "PSLV",
"cloudRegion": "Mangal",
"cloudHost": "Petrol",
"applicationNameVersion": "CRDI",
"applicationDeployedNameVersion": "Tuna",
"cloudStack": "DEV",
"version": "1.1.0"
},
Orversion": "version.1"
}
-------------
originator": {
"originatorDetail": {
"version": "1.1.0"
},
Orversion": "version.1"
}
Required Output
applicationDeployedId applicationDeployedNameVersion applicationNameVersion cloudHost cloudRegion cloudStack version Orversion
PSLV Tuna CRDI Petrol Mangal DEV 1.1.0 version.1
1.1.0 version.1

Use from_json function from Spark-2.4+
Read the parquet data then use from_json by passing the schema that matches with your json column.
Spark will read the matching data and adds non matching fields with null values.
Example:
df.show(10,False)
#+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|id |json_data #|
#+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|1 |{"originator": {"originatorDetail": {"applicationDeployedId": "PSLV","cloudRegion": "Mangal","cloudHost": "Petrol","applicationNameVersion": "CRDI","applicationDeployedNameVersion": "Tuna","cloudStack": "DEV","version": "1.1.0"},"Orversion": "version.1"}}|
#|2 |{"originator": { "originatorDetail": { "version": "1.1.0" }, "Orversion": "version.1"}} |
#+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
schema=StructType([StructField("originator",StructType([StructField("Orversion",StringType(),True),
StructField("originatorDetail",StructType([StructField("applicationDeployedId",StringType(),True),
StructField("applicationDeployedNameVersion",StringType(),True),
StructField("applicationNameVersion",StringType(),True),
StructField("cloudHost",StringType(),True),
StructField("cloudRegion",StringType(),True),
StructField("cloudStack",StringType(),True),
StructField("version",StringType(),True)]),True)]),True)])
from pyspark.sql.functions import *
from pyspark.sql.types import *
#then read the json_data column using from_json function
df.withColumn("json_converted",from_json(col("json_data"),schema)).select("id","json_converted").show(10,False)
#+---+--------------------------------------------------------+
#|id |json_converted |
#+---+--------------------------------------------------------+
#|1 |[[version.1, [PSLV, Tuna,, Petrol, Mangal, DEV, 1.1.0]]]|
#|2 |[[version.1, [,,,,,, 1.1.0]]] |
#+---+--------------------------------------------------------+
df.withColumn("json_converted",from_json(col("json_data"),schema)).select("id","json_converted").printSchema()
#root
# |-- id: long (nullable = true)
# |-- json_converted: struct (nullable = true)
# | |-- originator: struct (nullable = true)
# | | |-- Orversion: string (nullable = true)
# | | |-- originatorDetail: struct (nullable = true)
# | | | |-- applicationDeployedId: string (nullable = true)
# | | | |-- applicationDeployedNameVersion: string (nullable = true)
# | | | |-- applicationNameVersi: string (nullable = true)
# | | | |-- cloudHost: string (nullable = true)
# | | | |-- cloudRegion: string (nullable = true)
# | | | |-- cloudStack: string (nullable = true)
# | | | |-- version: string (nullable = true)
#even though we don't have all fields from id=2 still we added fields
df.withColumn("json_converted",from_json(col("json_data"),schema)).select("json_converted.originator.originatorDetail.applicationDeployedId").show(10,False)
#+---------------------+
#|applicationDeployedId|
#+---------------------+
#|PSLV |
#|null |
#+---------------------+

what is optimal way to parse following kafka JSON message to pyspark dataframe?

I'm using spark structured streaming to read kafka topic and want to convert following complex JSON (kafka-msgs) in to dataframe having "NAME,ADDRESS,DESCRIPTION,CODE,DEPARTMENT,INFA_OP_TYPE,DTL__CAPXTIMESTAMP" columns.
{
"meta_data": [{"name":{"string":"INFA_SEQUENCE"},"value":
{"string":"2,PWX_GENERIC"},"type":null},
{"name":{"string":"INFA_TABLE_NAME"},"value":{"string":"customers"},"type":null},
{"name":{"string":"INFA_OP_TYPE"},"value":{"string":"INSERT_EVENT"},"type":null},
{"name":{"string":"DTL__CAPXRESTART1"},"value":{"string":"B+IABwAfA"},"type":null},
{"name":{"string":"DTL__CAPXRESTART2"},"value":{"string":"AAABpMwgRDk="},"type":null},
{"name":{"string":"DTL__CAPXUOW"},"value":{"string":"AAMKPgAAqaIABg=="},"type":null},
{"name":{"string":"DTL__CAPXUSER"},"value":null,"type":null},
{"name":{"string":"DTL__CAPXTIMESTAMP"},"value":{"string":"201807310934257270000000"},"type":null},
{"name":{"string":"DTL__CAPXACTION"},"value":{"string":"I"},"type":null}],
"columns":{"array":[{"name":{"string":"NAME"},"value":{"string":"ABCD"},"isPresent":{"boolean":true}},
{"name":{"string":"ADDRESS"},"value":{"string":"123,Bark street"},"isPresent":{"boolean":true}},
{"name":{"string":"DESCRIPTION"},"value":{"string":"Canadian"},"isPresent":{"boolean":true}},
{"name":{"string":"CODE"},"value":{"string":"3_1"},"isPresent":{"boolean":true}},
{"name":{"string":"DEPARTMENT"},"value":{"string":"HR"},"isPresent":{"boolean":true}}
] }
}
I'm able to extract two json object "meta_data" and "columns" but I'm unable to explode "columns.array"
newJsonObj = events.select(get_json_object(events.value,'$.meta_data').alias('meta_data'),get_json_object(events.value,'$.columns.array').alias('columns'))
And I don't know how to extract values from two json object and create dataframe having columns from both json object.
-- Schema of events dataframe --
root
|-- columns: struct (nullable = true)
| |-- array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- isPresent: struct (nullable = true)
| | | | |-- boolean: boolean (nullable = true)
| | | |-- name: struct (nullable = true)
| | | | |-- string: string (nullable = true)
| | | |-- value: struct (nullable = true)
| | | | |-- string: string (nullable = true)
|-- meta_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: struct (nullable = true)
| | | |-- string: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- value: struct (nullable = true)
| | | |-- string: string (nullable = true)

how to parse the wiki infobox json with scala spark

I was trying to get the data from json data which I got it from wiki api
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Rajanna&rvsection=0
I was able to print the schema of that exactly
scala> data.printSchema
root
|-- batchcomplete: string (nullable = true)
|-- query: struct (nullable = true)
| |-- pages: struct (nullable = true)
| | |-- 28597189: struct (nullable = true)
| | | |-- ns: long (nullable = true)
| | | |-- pageid: long (nullable = true)
| | | |-- revisions: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- *: string (nullable = true)
| | | | | |-- contentformat: string (nullable = true)
| | | | | |-- contentmodel: string (nullable = true)
| | | |-- title: string (nullable = true)
I want to extract the data of the key "*" |-- *: string (nullable = true)
Please suggest me a solution.
One problem is
pages: struct (nullable = true)
| | |-- 28597189: struct (nullable = true)
the number 28597189 is unique to every title.

First we need to parse the json to get the key (28597189) dynamically then use this to extract the data from spark dataframe like below
val keyName = dataFrame.selectExpr("query.pages.*").schema.fieldNames(0)
println(s"Key Name : $keyName")
this will give you the key dynamically:
Key Name : 28597189
Then use this to extract the data
var revDf = dataFrame.select(explode(dataFrame(s"query.pages.$keyName.revisions")).as("revision")).select("revision.*")
revDf.printSchema()
Output:
root
|-- *: string (nullable = true)
|-- contentformat: string (nullable = true)
|-- contentmodel: string (nullable = true)
and we will be renaming the column * with some key name like star_column
revDf = revDf.withColumnRenamed("*", "star_column")
revDf.printSchema()
Output:
root
|-- star_column: string (nullable = true)
|-- contentformat: string (nullable = true)
|-- contentmodel: string (nullable = true)
and once we have our final dataframe we will call show
revDf.show()
Output:
+--------------------+-------------+------------+
| star_column|contentformat|contentmodel|
+--------------------+-------------+------------+
|{{EngvarB|date=Se...| text/x-wiki| wikitext|
+--------------------+-------------+------------+

How to parse jsonfile with spark

I have a jsonfile to be parsed.The json format is like this :
{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}
I have to get every word in the file.How can I get the "major" from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")?
This is the result I want:
cv_id major degree birthyear state
001 English Bachelor 1984 New York
001 English Master  1984 New York

This might not be the best way of doing it but you can give it a shot.
// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._
//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")
jsonDf.printSchema
Your schema would be :
root
|-- cv_id: string (nullable = true)
|-- cv_parse: struct (nullable = true)
| |-- basic_info: struct (nullable = true)
| | |-- birthyear: string (nullable = true)
| | |-- location: struct (nullable = true)
| | | |-- state: string (nullable = true)
| |-- educations: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- degree: string (nullable = true)
| | | |-- major: string (nullable = true)
Now you need can have explode the educations column
val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
$"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")
explodedResult.printSchema
Now your schema would be
root
|-- cv_id: string (nullable = true)
|-- col: struct (nullable = true)
| |-- degree: string (nullable = true)
| |-- major: string (nullable = true)
|-- birthyear: string (nullable = true)
|-- state: string (nullable = true)
Now you can select the columns
explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show
+-----+---------+--------+--------+-------+
|cv_id|birthyear| state| degree| major|
+-----+---------+--------+--------+-------+
| 001| 1984|New York|Bachelor|English|
| 001| 1984|New York| Master |English|
+-----+---------+--------+--------+-------+

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Apache spark: Write JSON DataFrame partitionBy nested columns - json

try the following code: val renamedDf = df .select(explode(col("data")) as "x" ) .select($"x.*") renamedDf.write.partitionBy("id").json(<path_to_folder>)

Related

Define a schema from a DF column in array type

Read Pyspark Struct Json Column Non Required elements

what is optimal way to parse following kafka JSON message to pyspark dataframe?

how to parse the wiki infobox json with scala spark

How to parse jsonfile with spark

Categories

Resources