Flatten a json column containing multiple comma separated json in spark dataframe - json

In my spark dataframe I have a column which contains a single json having multiple comma separated json having key value pair. Need to faltten the json data in different columns.
The record of json column student_data looks like below
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
|id|name |student_data |
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
|11|stephy|{{"key":"hindi","value":{"hindi_mythology":80}},{"key":"social_science","value":{"civics":65}},{"key":"maths","value":{"geometry":70}}}|
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
Schema of record is as below.
root
|-- id : int
|-- name : string
|-- student_data : string
The requirement is to flatten the json as expected output is as below.
+-----------+-----+--------------+------+
|id |name |hindi|social_science|maths |
+---+-------+-----+--------------+------+
|1 |stephy |80 |65 |70 |
+---+-------+-----+-----+--------+------+

You can transform your json into a struct type using spark function from_json() using a schema that represent the schema of the json string, after that to get the expected results you can pivot the column to go from rows into column format:
The input jdon file:
{
"id": 11,
"name": "stephy",
"student_data": "[{\"key\":\"hindi\",\"value\":{\"hindi_mythology\":80}},{\"key\":\"social_science\",\"value\":{\"civics\":65}},{\"key\":\"maths\",\"value\":{\"geometry\":70}}]"
}
Code:
val df = spark.read.json("file.json")
val schema = new StructType()
.add("key", StringType, true)
.add("value", MapType(StringType, IntegerType), true)
val res = df.withColumn("student_data", from_json(col("student_data"), ArrayType(schema)))
.select(col("id"), col("name"), explode(col("student_data")).as("student_data"))
.select("id", "name", "student_data.*")
.select(col("id"), col("name"), col("key"), map_values(col("value")).getItem(0).as("value"))
res.groupBy("id", "name").pivot("key").agg(first(col("value"))).show(false)
+---+------+-----+-----+--------------+
|id |name |hindi|maths|social_science|
+---+------+-----+-----+--------------+
|11 |stephy|80 |70 |65 |
+---+------+-----+-----+--------------+

Related

Flattening json string in spark

I have the following dataframe in spark:
root
|-- user_id: string (nullable = true)
|-- payload: string (nullable = true)
in which payload is an json string with no fixed schema, here are some sample data:
{'user_id': '001','payload': '{"country":"US","time":"11111"}'}
{'user_id': '002','payload': '{"message_id":"8936716"}'}
{'user_id': '003','payload': '{"brand":"adidas","when":""}'}
I want to output the above data in json format with the flattened payload(basically just extracting key value pairs from payload and put them into the root level), for example:
{'user_id': '001','country':'US','time':'11111'}
{'user_id': '002','message_id':'8936716'}
{'user_id': '003','brand':'adidas','when':''}
Stackoverflow said this is a duplicated question to Flatten Nested Spark Dataframe but it's not..
The difference here is that the value of payload in my case is just string type.
You can parse the payload JSON as a map<string,string> and add the user_id to the payload:
import pyspark.sql.functions as F
# input dataframe
df.show(truncate=False)
+-------+-------------------------------+
|user_id|payload |
+-------+-------------------------------+
|001 |{"country":"US","time":"11111"}|
|002 |{"message_id":"8936716"} |
|003 |{"brand":"adidas","when":""} |
+-------+-------------------------------+
df2 = df.select(
F.to_json(
F.map_concat(
F.create_map(F.lit('user_id'), F.col('user_id')),
F.from_json('payload', 'map<string,string>')
)
).alias('out')
)
df2.show(truncate=False)
+-----------------------------------------------+
|out |
+-----------------------------------------------+
|{"user_id":"001","country":"US","time":"11111"}|
|{"user_id":"002","message_id":"8936716"} |
|{"user_id":"003","brand":"adidas","when":""} |
+-----------------------------------------------+
To write it to a JSON file, you can do:
df2.coalesce(1).write.text('filepath')
This is how I finally solved the problem
json_schema = spark.read.json(source_parquet_df.rdd.map(lambda row: row.payload)).schema
new_df=source_parquet_df.withColumn('payload_json_obj',from_json(col('payload'),json_schema)).drop(source_parquet_df.payload)
flat_df = new_df.select([c for c in new_df.columns if c != 'payload_json_obj']+['payload_json_obj.*'])

How to parse json string to different columns in spark scala?

While reading parquet file this is the following file data
|id |name |activegroup|
|1 |abc |[{"groupID":"5d","role":"admin","status":"A"},{"groupID":"58","role":"admin","status":"A"}]|
data types of each field
root
|--id : int
|--name : String
|--activegroup : String
activegroup column is string explode function is not working. Following is the required output
|id |name |groupID|role|status|
|1 |abc |5d |admin|A |
|1 |def |58 |admin|A |
Do help me with parsing the above in spark scala latest version
First you need to extract the json schema:
val schema = schema_of_json(lit(df.select($"activeGroup").as[String].first))
Once you got it, you can convert your activegroup column, which is a String to json (from_json), and then explode it.
Once the column is a json, you can extract it's values with $"columnName.field"
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name",
$"jsonColumn.groupId" as "groupId",
$"jsonColumn.role" as "role",
$"jsonColumn.status" as "status")
If you want to extract the whole json and the element names are ok to you you can use the * to do it:
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name", $"jsonColumn.*")
RESULT
+---+----+-------+-----+------+
| id|name|groupId| role|status|
+---+----+-------+-----+------+
| 1| abc| 5d|admin| A|
| 1| abc| 58|admin| A|
+---+----+-------+-----+------+

pyspark convert row to json with nulls

Goal:
For a dataframe with schema
id:string
Cold:string
Medium:string
Hot:string
IsNull:string
annual_sales_c:string
average_check_c:string
credit_rating_c:string
cuisine_c:string
dayparts_c:string
location_name_c:string
market_category_c:string
market_segment_list_c:string
menu_items_c:string
msa_name_c:string
name:string
number_of_employees_c:string
number_of_rooms_c:string
Months In Role:integer
Tenured Status:string
IsCustomer:integer
units_c:string
years_in_business_c:string
medium_interactions_c:string
hot_interactions_c:string
cold_interactions_c:string
is_null_interactions_c:string
I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code
df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))
I am having one issue:
Issue:
When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""
Any tips?
You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.
Consider the following example DataFrame:
data = [
('one', 1, 10),
(None, 2, 20),
('three', None, 30),
(None, None, 40)
]
sdf = spark.createDataFrame(data, ["A", "B", "C"])
sdf.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: long (nullable = true)
# |-- C: long (nullable = true)
Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.
from pyspark.sql.functions import col, to_json, struct, when, lit
sdf = sdf.withColumn(
"JSON",
to_json(
struct(
[
when(
col(x).isNotNull(),
col(x)
).otherwise(lit("")).alias(x)
for x in sdf.columns
]
)
)
)
sdf.show()
#+-----+----+---+-----------------------------+
#|A |B |C |JSON |
#+-----+----+---+-----------------------------+
#|one |1 |10 |{"A":"one","B":"1","C":"10"} |
#|null |2 |20 |{"A":"","B":"2","C":"20"} |
#|three|null|30 |{"A":"three","B":"","C":"30"}|
#|null |null|40 |{"A":"","B":"","C":"40"} |
#+-----+----+---+-----------------------------+
Another option is to use pyspark.sql.functions.coalesce instead of when:
from pyspark.sql.functions import coalesce
sdf.withColumn(
"JSON",
to_json(
struct(
[coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
)
)
).show(truncate=False)
## Same as above

from_json of Spark sql return null values

I loaded a parquet file into a spark dataframe as follows :
val message= spark.read.parquet("gs://defenault-zdtt-devde/pubsub/part-00001-e9f8c58f-7de0-4537-a7be-a9a8556sede04a-c000.snappy.parquet")
when I perform a collect on my dataframe I get the following result :
message.collect()
Array[org.apache.spark.sql.Row] = Array([118738748835150,2018-08-20T17:44:38.742Z,{"id":"uplink-3130-85bc","device_id":60517119992794222,"group_id":69,"group":"box-2478-2555","profile_id":3,"profile":"eolane-movee","type":"uplink","timestamp":"2018-08-20T17:44:37.048Z","count":3130,"payload":[{"timestamp":"2018-08-20T17:44:37.048Z","data":{"battery":3.5975599999999996,"temperature":27}}],"payload_encrypted":"9da25e36","payload_cleartext":"fe1b01aa","device_properties":{"appeui":"7ca97df000001190","deveui":"7ca97d0000001bb0","external_id":"Product: 3.7 / HW: 3.1 / SW: 1.8.8","no_de_serie_eolane":"4904","no_emballage":"S02066","product_version":"1.3.1"},"protocol_data":{"AppNonce":"e820ef","DevAddr":"0e6c5fda","DevNonce":"85bc","NetID":"000007","best_gateway_id":"M40246","gateway.
The schema of this dataframe is
message.printSchema()
root
|-- Id: string (nullable = true)
|-- publishTime: string (nullable = true)
|-- data: string (nullable = true)
My aim is to work on the data column which holds json data and to flatten it.
I wrote the following code
val schemaTotal = new StructType (
Array (StructField("id",StringType,false),StructField("device_id",StringType),StructField("group_id",LongType), StructField("group",StringType),StructField("profile_id",IntegerType),StructField("profile",StringType),StructField("type",StringType),StructField("timestamp",StringType),
StructField("count",StringType),
StructField("payload",new StructType ()
.add("timestamp",StringType)
.add("data",new ArrayType (new StructType().add("battery",LongType).add("temperature",LongType),false))),
StructField("payload_encrypted",StringType),
StructField("payload_cleartext",StringType),
StructField("device_properties", new ArrayType (new StructType().add("appeui",StringType).add("deveui",StringType).add("external_id",StringType).add("no_de_serie_eolane",LongType).add("no_emballage",StringType).add("product_version",StringType),false)),
StructField("protocol_data", new ArrayType (new StructType().add("AppNonce",StringType).add("DevAddr",StringType).add("DevNonce",StringType).add("NetID",LongType).add("best_gateway_id",StringType).add("gateways",IntegerType).add("lora_version",IntegerType).add("noise",LongType).add("port",IntegerType).add("rssi",DoubleType).add("sf",IntegerType).add("signal",DoubleType).add("snr",DoubleType),false)),
StructField("lat",StringType),
StructField("lng",StringType),
StructField("geolocation_type",StringType),
StructField("geolocation_precision",StringType),
StructField("delivered_at",StringType)))
val dataframe_extract=message.select($"Id",
$"publishTime",
from_json($"data",schemaTotal).as("content"))
val table = dataframe_extract.select(
$"Id",
$"publishTime",
$"content.id" as "id",
$"content.device_id" as "device_id",
$"content.group_id" as "group_id",
$"content.group" as "group",
$"content.profile_id" as "profile_id",
$"content.profile" as "profile",
$"content.type" as "type",
$"content.timestamp" as "timestamp",
$"content.count" as "count",
$"content.payload.timestamp" as "timestamp2",
$"content.payload.data.battery" as "battery",
$"content.payload.data.temperature" as "temperature",
$"content.payload_encrypted" as "payload_encrypted",
$"content.payload_cleartext" as "payload_cleartext",
$"content.device_properties.appeui" as "appeui"
)
table.show() gives me null values for all columns:
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
| Id| publishTime| id|device_id|group_id|group|profile_id|profile|type|timestamp|count|timestamp2|battery|temperature|payload_encrypted|payload_cleartext|appeui|
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
|118738748835150|2018-08-20T17:44:...|null| null| null| null| null| null|null| null| null| null| null| null| null| null| null|
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
, whereas table.printSchema() gives me the expected result, any idea how to solve this, please?
I am working with Zeppelin as a first prototyping step thanks a lot in advance for your help.
Best Regards
from_json() SQL function has below constraint to be followed to convert column value to a dataframe.
whatever the datatype you have defined in the schema should match with the value present in the json, if there is any column's mismatch value leads to null in all column values
e.g.:
'{"name": "raj", "age": 12}' for this column value
StructType(List(StructField(name,StringType,true),StructField(age,StringType,true)))
The above schema will return you a null value on both the columns
StructType(List(StructField(name,StringType,true),StructField(age,IntegerType,true)))
The above schema will return you an expected dataframe
For this thread possible reason could be this, if there is any mismatched column value present, from_json will return all column value as null

Fit a json string to a DataFrame using a schema

I have a schema that looks like this:
StructType(StructField(keys,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I have a json string(that matches this schema) that I need to convert to fit the above schema.
"{"keys" : [2.0, 1.0]}"
How I proceed to get a dataframe out of this string to get a DataFrame that matches my schema?
Following are the steps I have tried in a scala notebook:
val rddData2 = sc.parallelize("""{"keys" : [1.0 , 2.0] }""" :: Nil)
val in = session.read.schema(schema).json(rddData2)
in.show
This is the output being shown:
+-----------+
|keys |
+-----------+
|null |
+-----------+
If you have a json string as
val jsonString = """{"keys" : [2.0, 1.0]}"""
then you can create a dataframe without schema as
val jsonRdd = sc.parallelize(Seq(jsonString))
val df = sqlContext.read.json(jsonRdd)
which should give you
+----------+
|keys |
+----------+
|[2.0, 1.0]|
+----------+
with schema
root
|-- keys: array (nullable = true)
| |-- element: double (containsNull = true)
Now if you want to convert the array column created by default to Vector, then you would need a udf function as
import org.apache.spark.sql.functions._
def vectorUdf = udf((array: collection.mutable.WrappedArray[Double]) => org.apache.spark.ml.linalg.Vectors.dense(Array(array: _*)))
and call the udf function using .withColumn as
df.withColumn("keys", vectorUdf(col("keys")))
You should be getting dataframe with schema as
root
|-- keys: vector (nullable = true)