Convert column with string json string to column with dictionary in pyspark - json

I have a column with following structure in my dataframe.
+--------------------+
| data|
+--------------------+
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
+--------------------+
only showing top 5 rows
The data inside column is a json string. I want to convert the column to some other type (map, struct..). How do I do this with a udf function? I have created a function like this but cant figure out what the return type should be. I tried StructType and MapType which threw error. This is my code.
import json
from pyspark.sql.types import MapType, StructType
udf_getDict = F.udf(lambda x: json.loads(x), StructType)
subset.select(udf_getDict(F.col('data'))).printSchema()

You can use an approach with spark.read.json and df.rdd.map such as this:
json_string = """
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
"""
df2 = spark.createDataFrame(
[
(1, json_string),
],
['id', 'txt']
)
df2.dtypes
[('id', 'bigint'), ('txt', 'string')]
new_df = spark.read.json(df2.rdd.map(lambda r: r.txt))
new_df.printSchema()
root
|-- glossary: struct (nullable = true)
| |-- GlossDiv: struct (nullable = true)
| | |-- GlossList: struct (nullable = true)
| | | |-- GlossEntry: struct (nullable = true)
| | | | |-- Abbrev: string (nullable = true)
| | | | |-- Acronym: string (nullable = true)
| | | | |-- GlossDef: struct (nullable = true)
| | | | | |-- GlossSeeAlso: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- para: string (nullable = true)
| | | | |-- GlossSee: string (nullable = true)
| | | | |-- GlossTerm: string (nullable = true)
| | | | |-- ID: string (nullable = true)
| | | | |-- SortAs: string (nullable = true)
| | |-- title: string (nullable = true)
| |-- title: string (nullable = true)

Related

Define a schema from a DF column in array type

I have a metadata file with a column with information on the schema of a file:
[{"column_datatype": "varchar", "column_description": "Indicates whether the Customer belongs to a particular business size, business activity, retail segment, demography, or other group and is used for reporting on regio performance regio migration.", "column_length": "4", "column_name": "clnt_grp_cd", "column_personally_identifiable_information": "False", "column_precision": "4", "column_primary_key": "True", "column_scale": null, "column_security_classifications": [], "column_sequence_number": "1"}
root
|-- column_info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- column_datatype: string (nullable = true)
| | |-- column_description: string (nullable = true)
| | |-- column_length: string (nullable = true)
| | |-- column_name: string (nullable = true)
| | |-- column_personally_identifiable_information: string (nullable = true)
| | |-- column_precision: string (nullable = true)
| | |-- column_primary_key: string (nullable = true)
| | |-- column_scale: string (nullable = true)
| | |-- column_security_classifications: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- column_sequence_number: string (nullable = true)
I want to read a df using this schema. Something like:
schema = StructType([ \
StructField("clnt_grp_cd",StringType(),True),\
StructField("clnt_grp_lvl1_nm",StringType(),True),\
(...)
])
df = spark.read.schema(schema).format("csv").option("header","true").load(filenamepath)
Is there a built in method to parse this as a schema?

parse json data with spark 2.3

I have the following json data :
{
"3200": {
"id": "3200",
"value": [
"cat",
"dog"
]
},
"2000": {
"id": "2000",
"value": [
"bird"
]
},
"2500": {
"id": "2500",
"value": [
"kitty"
]
},
"3650": {
"id": "3650",
"value": [
"horse"
]
}
}
the schema of this data , with printSchema utilty after we load the data with spark is as follows:
root
|-- 3200: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 2000: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 2500: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 3650: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
and I want to get the following dataframe
id value
3200 cat
2000 bird
2500 kitty
3200 dog
3650 horse
How can I do the parsing to get this expected output
Using spark-sql
Dataframe step (same as in Mohana's answer)
val df = spark.read.json(Seq(jsonData).toDS())
Build a temp view
df.createOrReplaceTempView("df")
Result:
val cols_k = df.columns.map( x => s"`${x}`.id" ).mkString(",")
val cols_v = df.columns.map( x => s"`${x}`.value" ).mkString(",")
spark.sql(s"""
with t1 ( select map_from_arrays(array(${cols_k}),array(${cols_v})) s from df ),
t2 ( select explode(s) (key,value) from t1 )
select key, explode(value) value from t2
""").show(false)
+----+-----+
|key |value|
+----+-----+
|2000|bird |
|2500|kitty|
|3200|cat |
|3200|dog |
|3650|horse|
+----+-----+
You can use stack() function to transpose the dataframe then extract key field and explode value field using explode_outer function.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val jsonData = """{
| "3200": {
| "id": "3200",
| "value": [
| "cat",
| "dog"
| ]
| },
| "2000": {
| "id": "2000",
| "value": [
| "bird"
| ]
| },
| "2500": {
| "id": "2500",
| "value": [
| "kitty"
| ]
| },
| "3650": {
| "id": "3650",
| "value": [
| "horse"
| ]
| }
|}
|""".stripMargin
val df = spark.read.json(Seq(jsonData).toDS())
df.selectExpr("stack (4, *) key")
.select(expr("key.id").as("key"),
explode_outer(expr("key.value")).as("value"))
.show(false)
+----+-----+
|key |value|
+----+-----+
|2000|bird |
|2500|kitty|
|3200|cat |
|3200|dog |
|3650|horse|
+----+-----+

Traversing through the Json object

I have a json file which has the following data:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": [
"GML",
"XML"
]
},
"GlossSee": "markup"
}
}
}
}
}
I need to read this file in pyspark and traverse through all the elements in the json. I need to recognize all the struct, array and array of struct columns and need to create separate hive tables for each struct and array column.
For Example:
Glossary will be one table with "title" as the column
GlossEntry will be another table with columns "ID", "SortAs", "GlossTerm", "acronym", "abbrev"
The data will grow in the future with more nested structures. So i will have to write a generalized code which traverses through all the JSON elements and recognizes all the structs and array columns.
Is there a way to loop through every element in the nested struct?
Spark is able to automatically parse and infer json schema. Once its in the spark dataframe, you can access elements with the json by specifying its path.
json_df = spark.read.json(filepath)
json_df.printSchema()
Output:
root
|-- glossary: struct (nullable = true)
| |-- GlossDiv: struct (nullable = true)
| | |-- GlossList: struct (nullable = true)
| | | |-- GlossEntry: struct (nullable = true)
| | | | |-- Abbrev: string (nullable = true)
| | | | |-- Acronym: string (nullable = true)
| | | | |-- GlossDef: struct (nullable = true)
| | | | | |-- GlossSeeAlso: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- para: string (nullable = true)
| | | | |-- GlossSee: string (nullable = true)
| | | | |-- GlossTerm: string (nullable = true)
| | | | |-- ID: string (nullable = true)
| | | | |-- SortAs: string (nullable = true)
| | |-- title: string (nullable = true)
| |-- title: string (nullable = true)
Then choose the fields to extract:
json_df.select("glossary.title").show()
json_df.select("glossary.GlossDiv.GlossList.GlossEntry.*").select("Abbrev","Acronym","ID","SortAs").show()
Extracted output:
+----------------+
| title|
+----------------+
|example glossary|
+----------------+
+-------------+-------+----+------+
| Abbrev|Acronym| ID|SortAs|
+-------------+-------+----+------+
|ISO 8879:1986| SGML|SGML| SGML|
+-------------+-------+----+------+

Apache spark: Write JSON DataFrame partitionBy nested columns

I have this kind of JSON data:
{
"data": [
{
"id": "4619623",
"team": "452144",
"created_on": "2018-10-09 02:55:51",
"links": {
"edit": "https://some_page",
"publish": "https://some_publish",
"default": "https://some_default"
}
},
{
"id": "4619600",
"team": "452144",
"created_on": "2018-10-09 02:42:25",
"links": {
"edit": "https://some_page",
"publish": "https://some_publish",
"default": "https://some_default"
}
}
}
I read this data using Apache spark and I want to write them partition by id column. When I use this:
df.write.partitionBy("data.id").json(<path_to_folder>)
I will get error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Partition column data.id not found in schema
I also tried to use explode function like that:
import org.apache.spark.sql.functions.{col, explode}
val renamedDf= df.withColumn("id", explode(col("data.id")))
renamedDf.write.partitionBy("id").json(<path_to_folder>)
That actually helped, but each id partition folder contained the same original JSON file.
EDIT: schema of df DataFrame:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
Schema of renamedDf DataFrame:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
|-- id: string (nullable = true)
I am using spark 2.1.0
I found this solution: DataFrame partitionBy on nested columns
And this example:http://bigdatums.net/2016/02/12/how-to-extract-nested-json-data-in-spark/
But none of this helped me to solve my problem.
Thanks in andvance for any help.
try the following code:
val renamedDf = df
.select(explode(col("data")) as "x" )
.select($"x.*")
renamedDf.write.partitionBy("id").json(<path_to_folder>)
You are just missing a select statement after the initial explode
val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("/FileStore/tables/test.json")
df.printSchema
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
| | |-- team: string (nullable = true)
import org.apache.spark.sql.functions.{col, explode}
val df1= df.withColumn("data", explode(col("data")))
df1.printSchema
root
|-- data: struct (nullable = true)
| |-- created_on: string (nullable = true)
| |-- id: string (nullable = true)
| |-- links: struct (nullable = true)
| | |-- default: string (nullable = true)
| | |-- edit: string (nullable = true)
| | |-- publish: string (nullable = true)
| |-- team: string (nullable = true)
val df2 = df1.select("data.created_on","data.id","data.team","data.links")
df2.show
+-------------------+-------+------+--------------------+
| created_on| id| team| links|
+-------------------+-------+------+--------------------+
|2018-10-09 02:55:51|4619623|452144|[https://some_def...|
|2018-10-09 02:42:25|4619600|452144|[https://some_def...|
+-------------------+-------+------+--------------------+
df2.write.partitionBy("id").json("FileStore/tables/test_part.json")
val f = spark.read.json("/FileStore/tables/test_part.json/id=4619600")
f.show
+-------------------+--------------------+------+
| created_on| links| team|
+-------------------+--------------------+------+
|2018-10-09 02:42:25|[https://some_def...|452144|
+-------------------+--------------------+------+
val full = spark.read.json("/FileStore/tables/test_part.json")
full.show
+-------------------+--------------------+------+-------+
| created_on| links| team| id|
+-------------------+--------------------+------+-------+
|2018-10-09 02:55:51|[https://some_def...|452144|4619623|
|2018-10-09 02:42:25|[https://some_def...|452144|4619600|
+-------------------+--------------------+------+-------+

what is optimal way to parse following kafka JSON message to pyspark dataframe?

I'm using spark structured streaming to read kafka topic and want to convert following complex JSON (kafka-msgs) in to dataframe having "NAME,ADDRESS,DESCRIPTION,CODE,DEPARTMENT,INFA_OP_TYPE,DTL__CAPXTIMESTAMP" columns.
{
"meta_data": [{"name":{"string":"INFA_SEQUENCE"},"value":
{"string":"2,PWX_GENERIC"},"type":null},
{"name":{"string":"INFA_TABLE_NAME"},"value":{"string":"customers"},"type":null},
{"name":{"string":"INFA_OP_TYPE"},"value":{"string":"INSERT_EVENT"},"type":null},
{"name":{"string":"DTL__CAPXRESTART1"},"value":{"string":"B+IABwAfA"},"type":null},
{"name":{"string":"DTL__CAPXRESTART2"},"value":{"string":"AAABpMwgRDk="},"type":null},
{"name":{"string":"DTL__CAPXUOW"},"value":{"string":"AAMKPgAAqaIABg=="},"type":null},
{"name":{"string":"DTL__CAPXUSER"},"value":null,"type":null},
{"name":{"string":"DTL__CAPXTIMESTAMP"},"value":{"string":"201807310934257270000000"},"type":null},
{"name":{"string":"DTL__CAPXACTION"},"value":{"string":"I"},"type":null}],
"columns":{"array":[{"name":{"string":"NAME"},"value":{"string":"ABCD"},"isPresent":{"boolean":true}},
{"name":{"string":"ADDRESS"},"value":{"string":"123,Bark street"},"isPresent":{"boolean":true}},
{"name":{"string":"DESCRIPTION"},"value":{"string":"Canadian"},"isPresent":{"boolean":true}},
{"name":{"string":"CODE"},"value":{"string":"3_1"},"isPresent":{"boolean":true}},
{"name":{"string":"DEPARTMENT"},"value":{"string":"HR"},"isPresent":{"boolean":true}}
] }
}
I'm able to extract two json object "meta_data" and "columns" but I'm unable to explode "columns.array"
newJsonObj = events.select(get_json_object(events.value,'$.meta_data').alias('meta_data'),get_json_object(events.value,'$.columns.array').alias('columns'))
And I don't know how to extract values from two json object and create dataframe having columns from both json object.
-- Schema of events dataframe --
root
|-- columns: struct (nullable = true)
| |-- array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- isPresent: struct (nullable = true)
| | | | |-- boolean: boolean (nullable = true)
| | | |-- name: struct (nullable = true)
| | | | |-- string: string (nullable = true)
| | | |-- value: struct (nullable = true)
| | | | |-- string: string (nullable = true)
|-- meta_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: struct (nullable = true)
| | | |-- string: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- value: struct (nullable = true)
| | | |-- string: string (nullable = true)