Merge json files by key using pyspark - json

Json file is in the following format:-
**Input-**
{'key-a' : [{'key1':'value1', 'key2':'value2'},{'key1':'value3', 'key2':'value4'}...],
'key-b':'value-b',
'key-c':'value-c'},
{'key-a' : [{'key1':'value5', 'key2':'value6'},{'key1':'value7', 'key2':'value8'}...],
'key-b':'value-b',
'key-c':'value-c'}
I need to combine the data to merge all the values of 'key-a' and return a single json object as output:
**Output-**
{'key-a' :
[{'key1':'value1', 'key2':'value2'},
{'key1':'value3', 'key2':'value4'},
{'key1':'value5', 'key2':'value6'},
{'key1':'value7', 'key2':'value8'}...],
'key-b':'value-b',
'key-c':'value-c'}
The data is loaded in a pyspark dataframe with the following schema:-
**Schema:**
key-a
|-- key1: string (nullable= false)
|-- key2: string (nullable= true)
key-b: string (nullable= true)
key-c: string (nullable= false)
I have tried using the groupbykey function but when I try to Show() the output, I get the following error: "groupeddata object has no attribute 'show' pyspark".
How to achieve the above transformation?
PFA- Error received when trying below answer

This can be a working solution for you -
# Create the dataframe here
df_new = spark.createDataFrame([(str({"key-a":[{"key1":"value1","key2":"value2"}, {"key1": "value3", "key2": "value4"}], "key-b" :"value-b"})), (str({"key-a":[{"key1":"value5","key2":"value6"}, {"key1": "value7", "key2": "value8"}], "key-b" :"value-b"}))],T.StringType())
df_new.show(truncate=False)
+-----------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------+
|{'key-a': [{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}], 'key-b': 'value-b'}|
|{'key-a': [{'key1': 'value5', 'key2': 'value6'}, {'key1': 'value7', 'key2': 'value8'}], 'key-b': 'value-b'}|
+-----------------------------------------------------------------------------------------------------------+
Use from_json with correct schema to valuate the column first -
The idea here is to get the keys of the json in a column and then use groupBy
df = df_new.withColumn('col', F.from_json("value",T.MapType(T.StringType(), T.StringType())))
df = df.select("col", F.explode("col").alias("x", "y"))
df.select("x", "y").show(truncate=False)
+-----+---------------------------------------------------------------------+
|x |y |
+-----+---------------------------------------------------------------------+
|key-a|[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]|
|key-b|value-b |
|key-a|[{"key1":"value5","key2":"value6"},{"key1":"value7","key2":"value8"}]|
|key-b|value-b |
+-----+---------------------------------------------------------------------+
Logic Here-
We have created a dummy column for the sake of grouping
df_grp = df.groupBy("x").agg(F.collect_set("y").alias("y"))
df_grp = df_grp.withColumn("y", F.col("y").cast(T.StringType()))
df_grp = df_grp.withColumn("array", F.array("x", "y"))
df_grp = df_grp.withColumn("dummy_col", F.lit("1"))
df_grp = df_grp.groupBy("dummy_col").agg(F.collect_set("array"))
df_grp.show(truncate=False)
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|dummy_col|collect_set(array) |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[[key-a, [[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}], [{"key1":"value5","key2":"value6"},{"key1":"value7","key2":"value8"}]]], [key-b, [value-b]]]|
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I have tried using the groupbykey function but when I try to Show() the output, I get the following error: "groupeddata object has no attribute 'show' pyspark".
This was giving trouble because you did not use any aggregate function in the groupBy clause.

Related

In Spark SQL, transform JSON key name into value

It seems like there should be a function for this in Spark SQL similar to pivoting, but I haven't found any solution to transforming a JSON key into a a value. Suppose I have a badly formed JSON (the format of which I cannot change):
{"A long string containing serverA": {"x": 1, "y": 2}}
how can I process it to
{"server": "A", "x": 1, "y": 2}
?
I read the JSONs into an an sql.dataframe and would then like to process them as described above:
val cs = spark.read.json("sample.json")
.???
If we want to use only spark functions and no UDFs, you could use from_json to parse the json into a map (we need to specify a schema). Then you just need to extract the information with spark functions.
One way to do it is as follows:
val schema = MapType(
StringType,
StructType(Array(
StructField("x", IntegerType),
StructField("y", IntegerType)
))
)
spark.read.text("...")
.withColumn("json", from_json('value, schema))
.withColumn("key", map_keys('json).getItem(0))
.withColumn("value", map_values('json).getItem(0))
.withColumn("server",
// Extracting the server name with a regex
regexp_replace(regexp_extract('key, "server[^ ]*", 0), "server", ""))
.select("server", "value.*")
.show(false)
which yields:
+------+---+---+
|server|x |y |
+------+---+---+
|A |1 |2 |
+------+---+---+

How to add field within nested JSON when reading from/writing to Kafka via a Spark dataframe

I've a Spark (v.3.0.1) job written in Java that reads Json from Kafka, does some transformation and then writes it back to Kafka. For now, the incoming message structure in Kafka is something like:
{"catKey": 1}. The output from the Spark job that's written back to Kafka is something like: {"catKey":1,"catVal":"category-1"}. The code for processing input data from Kafka goes something as follows:
DataFrameReader dfr = putSrcProps(spark.read().format("kafka"));
for (String key : srcProps.stringPropertyNames()) {
dfr = dfr.option(key, srcProps.getProperty(key));
}
Dataset<Row> df = dfr.option("group.id", getConsumerGroupId())
.load()
.selectExpr("CAST(value AS STRING) as value")
.withColumn("jsonData", from_json(col("value"), schemaHandler.getSchema()))
.select("jsonData.*");
// transform df
df.toJSON().write().format("kafka").option("key", "val").save()
I want to change the message structure in Kafka. Now, it should be of the format: {"metadata": <whatever>, "payload": {"catKey": 1}}. While reading, we need to read only the contents of the payload, so the dataframe remains similar. Also, while writing back to Kafka, first I need to wrap the msg in payload, add a metadata. The output will have to be of the format: {"metadata": <whatever>, "payload": {"catKey":1,"catVal":"category-1"}}. I've tried manipulating the contents of the selectExpr and from_json method, but no luck so far. Any pointer on how to achieve the functionality would be very much appreciated.
To extract the content of payload in your JSON you can use get_json_object. And to create the new output you can use the built-in functions struct and to_json.
Given a Dataframe:
val df = Seq(("""{"metadata": "whatever", "payload": {"catKey": 1}}""")).toDF("value").as[String]
df.show(false)
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|{"metadata": "whatever", "payload": {"catKey": 1}}|
+--------------------------------------------------+
Then creating the new column called "value"
val df2 = df
.withColumn("catVal", lit("category-1")) // whatever your logic is to fill this column
.withColumn("payload",
struct(
get_json_object(col("value"), "$.payload.catKey").as("catKey"),
col("catVal").as("catVal")
)
)
.withColumn("metadata",
get_json_object(col("value"), "$.metadata"),
).select("metadata", "payload")
df2.show(false)
+--------+---------------+
|metadata|payload |
+--------+---------------+
|whatever|[1, category-1]|
+--------+---------------+
val df3 = df2.select(to_json(struct(col("metadata"), col("payload"))).as("value"))
df3.show(false)
+----------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------+
|{"metadata":"whatever","payload":{"catKey":"1","catVal":"category-1"}}|
+----------------------------------------------------------------------+

How to dynamically reference items in a JSON struct using pyspark [duplicate]

I have a pyspark dataframe with StringType column (edges), which contains a list of dictionaries (see example below). The dictionaries contain a mix of value types, including another dictionary (nodeIDs). I need to explode the top-level dictionaries in the edges field into rows; ideally, I should then be able to convert their component values into separate fields.
Input:
import findspark
findspark.init()
SPARK = SparkSession.builder.enableHiveSupport() \
.getOrCreate()
data = [
Row(trace_uuid='aaaa', timestamp='2019-05-20T10:36:33+02:00', edges='[{"distance":4.382441320292239,"duration":1.5,"speed":2.9,"nodeIDs":{"nodeA":954752475,"nodeB":1665827480}},{"distance":14.48582171131768,"duration":2.6,"speed":5.6,"nodeIDs":{"nodeA":1665827480,"nodeB":3559056131}}]', count=156, level=36),
Row(trace_uuid='bbbb', timestamp='2019-05-20T11:36:10+03:00', edges='[{"distance":0,"duration":0,"speed":0,"nodeIDs":{"nodeA":520686131,"nodeB":520686216}},{"distance":8.654358326561642,"duration":3.1,"speed":2.8,"nodeIDs":{"nodeA":520686216,"nodeB":506361795}}]', count=179, level=258)
]
df = SPARK.createDataFrame(data)
Desired output:
data_reshaped = [
Row(trace_uuid='aaaa', timestamp='2019-05-20T10=36=33+02=00', distance=4.382441320292239, duration=1.5, speed=2.9, nodeA=954752475, nodeB=1665827480, count=156, level=36),
Row(trace_uuid='aaaa', timestamp='2019-05-20T10=36=33+02=00', distance=16.134844841712574, duration=2.9,speed=5.6, nodeA=1665827480, nodeB=3559056131, count=156, level=36),
Row(trace_uuid='bbbb', timestamp='2019-05-20T11=36=10+03=00', distance=0, duration=0, speed=0, nodeA=520686131, nodeB=520686216, count=179, level=258),
Row(trace_uuid='bbbb', timestamp='2019-05-20T11=36=10+03=00', distance=8.654358326561642, duration=3.1, speed=2.8, nodeA=520686216, nodeB=506361795, count=179, level=258)
]
Is there a way to do that? I've tried using cast to cast the edges field into an array first, but I can't figure out how to get it to work with the mixed data types.
I'm using Spark 2.4.0.
You can use from_json() with schema_of_json() to infer the JSON schema. for example:
from pyspark.sql import functions as F
# a sample json string:
edges_json_sample = data[0].edges
# or edges_json_sample = df.select('edges').first()[0]
>>> edges_json_sample
#'[{"distance":4.382441320292239,"duration":1.5,"speed":2.9,"nodeIDs":{"nodeA":954752475,"nodeB":1665827480}},{"distance":14.48582171131768,"duration":2.6,"speed":5.6,"nodeIDs":{"nodeA":1665827480,"nodeB":3559056131}}]'
# infer schema from the sample string
schema = df.select(F.schema_of_json(edges_json_sample)).first()[0]
>>> schema
#u'array<struct<distance:double,duration:double,nodeIDs:struct<nodeA:bigint,nodeB:bigint>,speed:double>>'
# convert json string to data structure and then retrieve desired items
new_df = df.withColumn('data', F.explode(F.from_json('edges', schema))) \
.select('*', 'data.*', 'data.nodeIDs.*') \
.drop('data', 'nodeIDs', 'edges')
>>> new_df.show()
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+
|count|level| timestamp|trace_uuid| distance|duration|speed| nodeA| nodeB|
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+
| 156| 36|2019-05-20T10:36:...| aaaa|4.382441320292239| 1.5| 2.9| 954752475|1665827480|
| 156| 36|2019-05-20T10:36:...| aaaa|14.48582171131768| 2.6| 5.6|1665827480|3559056131|
| 179| 258|2019-05-20T11:36:...| bbbb| 0.0| 0.0| 0.0| 520686131| 520686216|
| 179| 258|2019-05-20T11:36:...| bbbb|8.654358326561642| 3.1| 2.8| 520686216| 506361795|
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+
# expected result
data_reshaped = new_df.rdd.collect()

How to convert dataframe output to json format and then Normalize the data?

I am running a sql and output i am reading as pandas df. Now i need to convert the data in to json and need to normalize the data. I tried to_json but this give partial solution.
Dataframe output:
| SalesPerson | ContactID |
|12345 |Tom|
|12345 |Robin|
|12345 |Julie|
Expected JSON:
{"SalesPerson": "12345", "ContactID":"Tom","Robin","Julie"}
Please see below code which i tried.
q = Select COL1, SalesPerson , ContactIDfrom table;
df = pd.read_sql(q, sqlconn)
df1=df.iloc[:, 1:2]
df2 = df1.to_json(orient='records')
also to_json result bracket which i also dont need.
Try this:
df.groupby('SalesPerson').apply(lambda x: pd.Series({
'ContactID': x['ContactID'].values
})).reset_index().to_json(orient='records')
Output (pretty printed):
[
{
"SalesPerson": 1,
"ContactID": ["Tom", "Robin", "Julie"]
},
{
"SalesPerson": 2,
"ContactID": ["Jack", "Mike", "Mary"]
}
]

In Apache Spark how could I merge multiple SQL columns from an exploded JSON Array?

I'm reading multiple JSON files from a directory; this JSON has multiple items 'cars' in an array. I'm trying to explode and merge the discrete values from the item 'car' to one dataframe.
A JSON file looks like:
{
"cars": {
"items":
[
{
"latitude": 42.0001,
"longitude": 19.0001,
"name": "Alex"
},
{
"latitude": 42.0002,
"longitude": 19.0002,
"name": "Berta"
},
{
"latitude": 42.0003,
"longitude": 19.0003,
"name": "Chris"
},
{
"latitude": 42.0004,
"longitude": 19.0004,
"name": "Diana"
}
]
}
}
My approaches to explode and merge the values to just one dataframe are:
// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()
/* Approach 1 */
// User-defined function to 'zip' two columns
val zip = udf((xs: Seq[Double], ys: Seq[Double]) => xs.zip(ys))
jsonDF.withColumn("vars", explode(zip($"cars.items.latitude", $"cars.items.longitude"))).select($"cars.items.name", $"vars._1".alias("varA"), $"vars._2".alias("varB"))
/* Apporach 2 */
val df = jsonData.select($"cars.items.name", $"cars.items.latitude", $"cars.items.longitude").toDF("name", "latitude", "longitude")
val df1 = df.select(explode(df("name")).alias("name"), df("latitude"), df("longitude"))
val df2 = df1.select(df1("name").alias("name"), explode(df1("latitude")).alias("latitude"), df1("longitude"))
val df3 = df2.select(df2("name"), df2("latitude"), explode(df2("longitude")).alias("longitude"))
As you may see the result of the Approach 1 is just a dataframe of two discrete 'merged' parameters like:
+--------------------+---------+---------+
| name| varA| varB|
+--------------------+---------+---------+
|[Leo, Britta, Gor...|48.161079|11.556778|
|[Leo, Britta, Gor...|48.124666|11.617682|
|[Leo, Britta, Gor...|48.352043|11.788091|
|[Leo, Britta, Gor...| 48.25184|11.636337|
The result for Approach is as follows:
+----+---------+---------+
|name| latitude|longitude|
+----+---------+---------+
| Leo|48.161079|11.556778|
| Leo|48.161079|11.617682|
| Leo|48.161079|11.788091|
| Leo|48.161079|11.636337|
| Leo|48.161079|11.560595|
| Leo|48.161079|11.788632|
(The result is a mapping of each 'name' with each 'latitude' with each 'longitude')
The result should be as follows:
+--------------------+---------+---------+
| name| varA| varB|
+--------------------+---------+---------+
|Leo |48.161079|11.556778|
|Britta |48.124666|11.617682|
|Gorch |48.352043|11.788091|
Do you know how read the files, split and merge the values that each line is just one object?
Thanks you very much for your help!
For getting the expected result you can try following approach:
// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()
// Approach
val df1 = jsonDF.select(explode(df("cars.items")).alias("items"))
val df2 = df1.select("items.name", "items.latitude", "items.longitude")
The above approach will provide you following result:
+-----+--------+---------+
| name|latitude|longitude|
+-----+--------+---------+
| Alex| 42.0001| 19.0001|
|Berta| 42.0002| 19.0002|
|Chris| 42.0003| 19.0003|
|Diana| 42.0004| 19.0004|
+-----+--------+---------+