Expand JSON from pySpark DataFrame into name / value pairs - json

I have a pySpark dataframe looking like this:
|id|json |
+--+--------------------------------------+
|1 |{"attr1": "value1"} |
|2 |{"attr2": "value2", "attr3": "value3"}|
root
|-- id: string (nullable = true)
|-- json: string (nullable = true)
How do I convert it into a new dataframe which will look like this:
|id|attr |value |
+--+-----+------+
|1 |attr1|value1|
|2 |attr2|value2|
|2 |attr3|value3|
(tried to google for the solution with no success, apologies if it's a duplicate)
Thanks!

Please check schema, looks to me like a map type. if column json is of maptype, use map_entries to extract elements and explode.
df=spark.createDataFrame(Data, schema)
new.withColumn('attri', explode(map_entries('json'))).select('id','attri.*').show()

Related

In Spark SQL, transform JSON key name into value

It seems like there should be a function for this in Spark SQL similar to pivoting, but I haven't found any solution to transforming a JSON key into a a value. Suppose I have a badly formed JSON (the format of which I cannot change):
{"A long string containing serverA": {"x": 1, "y": 2}}
how can I process it to
{"server": "A", "x": 1, "y": 2}
?
I read the JSONs into an an sql.dataframe and would then like to process them as described above:
val cs = spark.read.json("sample.json")
.???
If we want to use only spark functions and no UDFs, you could use from_json to parse the json into a map (we need to specify a schema). Then you just need to extract the information with spark functions.
One way to do it is as follows:
val schema = MapType(
StringType,
StructType(Array(
StructField("x", IntegerType),
StructField("y", IntegerType)
))
)
spark.read.text("...")
.withColumn("json", from_json('value, schema))
.withColumn("key", map_keys('json).getItem(0))
.withColumn("value", map_values('json).getItem(0))
.withColumn("server",
// Extracting the server name with a regex
regexp_replace(regexp_extract('key, "server[^ ]*", 0), "server", ""))
.select("server", "value.*")
.show(false)
which yields:
+------+---+---+
|server|x |y |
+------+---+---+
|A |1 |2 |
+------+---+---+

Extract and explode embedded json fields in apache spark

I'm completely new to spark, but don't mind if the answer is in python or Scala. I can't show the actual data for privacy reasons, but basically I am reading json files with a structure like this:
{
"EnqueuedTimeUtc": 'some date time',
"Properties": {},
"SystemProperties": {
"connectionDeviceId": "an id",
"some other fields that we don't care about": "data"
},
"Body": {
"device_id": "an id",
"tabs": [
{
"selected": false,
"title": "some title",
"url": "https:...."
},
{"same again, for multiple tabs"}
]
}
}
Most of the data is of no interest. What I want is a Dataframe consisting of the time, device_id, and url. There can be multiple urls for the same device and time, so I'm looking to explode these into one row per url.
| timestamp | device_id | url |
My immediate problem is that when I read this, although it can work out the structure of SystemProperties, Body is just a string, probably because of variation. Perhaps I need to specify the schema, would that help?
root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionAuthMethod: string (nullable = true)
| |-- connectionDeviceGenerationId: string (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
| |-- contentEncoding: string (nullable = true)
| |-- contentType: string (nullable = true)
| |-- enqueuedTime: string (nullable = true)
Any idea of an efficient (there are lots and lots of these records) way to extract urls and associate with the time and device_id? Thanks in advance.
Here's an example for extraction. Basically you can use from_json to convert the Body to something that is more structured, and use explode(transform()) to get the URLs and expand to different rows.
# Sample dataframe
df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
|Body |EnqueuedTimeUtc|SystemProperties|
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
|{"device_id":"an id","tabs":[{"selected":false,"title":"some title","url":"https:1"},{"selected":false,"title":"some title","url":"https:2"}]}|some date time |[an id] |
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
df.printSchema()
root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
# Extract desired properties
df2 = df.selectExpr(
"EnqueuedTimeUtc as timestamp",
"from_json(Body, 'device_id string, tabs array<map<string,string>>') as Body"
).selectExpr(
"timestamp",
"Body.device_id",
"explode(transform(Body.tabs, x -> x.url)) as url"
)
df2.show()
+--------------+---------+-------+
| timestamp|device_id| url|
+--------------+---------+-------+
|some date time| an id|https:1|
|some date time| an id|https:2|
+--------------+---------+-------+

Merge json files by key using pyspark

Json file is in the following format:-
**Input-**
{'key-a' : [{'key1':'value1', 'key2':'value2'},{'key1':'value3', 'key2':'value4'}...],
'key-b':'value-b',
'key-c':'value-c'},
{'key-a' : [{'key1':'value5', 'key2':'value6'},{'key1':'value7', 'key2':'value8'}...],
'key-b':'value-b',
'key-c':'value-c'}
I need to combine the data to merge all the values of 'key-a' and return a single json object as output:
**Output-**
{'key-a' :
[{'key1':'value1', 'key2':'value2'},
{'key1':'value3', 'key2':'value4'},
{'key1':'value5', 'key2':'value6'},
{'key1':'value7', 'key2':'value8'}...],
'key-b':'value-b',
'key-c':'value-c'}
The data is loaded in a pyspark dataframe with the following schema:-
**Schema:**
key-a
|-- key1: string (nullable= false)
|-- key2: string (nullable= true)
key-b: string (nullable= true)
key-c: string (nullable= false)
I have tried using the groupbykey function but when I try to Show() the output, I get the following error: "groupeddata object has no attribute 'show' pyspark".
How to achieve the above transformation?
PFA- Error received when trying below answer
This can be a working solution for you -
# Create the dataframe here
df_new = spark.createDataFrame([(str({"key-a":[{"key1":"value1","key2":"value2"}, {"key1": "value3", "key2": "value4"}], "key-b" :"value-b"})), (str({"key-a":[{"key1":"value5","key2":"value6"}, {"key1": "value7", "key2": "value8"}], "key-b" :"value-b"}))],T.StringType())
df_new.show(truncate=False)
+-----------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------+
|{'key-a': [{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}], 'key-b': 'value-b'}|
|{'key-a': [{'key1': 'value5', 'key2': 'value6'}, {'key1': 'value7', 'key2': 'value8'}], 'key-b': 'value-b'}|
+-----------------------------------------------------------------------------------------------------------+
Use from_json with correct schema to valuate the column first -
The idea here is to get the keys of the json in a column and then use groupBy
df = df_new.withColumn('col', F.from_json("value",T.MapType(T.StringType(), T.StringType())))
df = df.select("col", F.explode("col").alias("x", "y"))
df.select("x", "y").show(truncate=False)
+-----+---------------------------------------------------------------------+
|x |y |
+-----+---------------------------------------------------------------------+
|key-a|[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]|
|key-b|value-b |
|key-a|[{"key1":"value5","key2":"value6"},{"key1":"value7","key2":"value8"}]|
|key-b|value-b |
+-----+---------------------------------------------------------------------+
Logic Here-
We have created a dummy column for the sake of grouping
df_grp = df.groupBy("x").agg(F.collect_set("y").alias("y"))
df_grp = df_grp.withColumn("y", F.col("y").cast(T.StringType()))
df_grp = df_grp.withColumn("array", F.array("x", "y"))
df_grp = df_grp.withColumn("dummy_col", F.lit("1"))
df_grp = df_grp.groupBy("dummy_col").agg(F.collect_set("array"))
df_grp.show(truncate=False)
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|dummy_col|collect_set(array) |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[[key-a, [[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}], [{"key1":"value5","key2":"value6"},{"key1":"value7","key2":"value8"}]]], [key-b, [value-b]]]|
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have tried using the groupbykey function but when I try to Show() the output, I get the following error: "groupeddata object has no attribute 'show' pyspark".
This was giving trouble because you did not use any aggregate function in the groupBy clause.

Read data from MongoDB and apply Filter on Multilevel JSON in Scala Spark Connector

I am using the MongoDb Scala connector for Spark. In the documentation
https://docs.mongodb.com/spark-connector/master/scala/aggregation/
it is mentioned that how to apply filter on the given JSON document. What I am not able to figure out that if we have a multilevel json and we want to apply filter on it how will we access that key/value in the json Document.
Json Document:
{ "_id" : 1, "test" : 1 }
{ "_id" : 2, "test" : 2 }
{ "_id" : 3, "test" : 3 }
{ "_id" : 4, "test" : 4 }
{ "_id" : 5, "test" : 5 }
{ "_id" : 6, "test" : 6 }
Filter Document:
val rdd = MongoSpark.load(sc)
val filteredRdd = rdd.filter(doc => doc.getInteger("test") > 5)
println(filteredRdd.count)
println(filteredRdd.first.toJson)
Multilevel Json Document
{
"_id": 1,
"test": 1,
"additionalProperties": {
"value": "35",
"phone": "566623232"
}
}
Problem Statement:
I want to filter on the basis of "value" attribute but I don't know how to access it. I tried to do following but it is not working.
val filteredRdd = rdd.filter(doc => doc.getInteger("value") > 5)
val filteredRdd = rdd.filter(doc => doc.getInteger("additionalProperties.value") > 5)
Can anybody guide me that how can I access the "value" attribute? What would be the right syntax.
Some Other Options that I have tried:
According to the official documentation of Scala Connector for Spark of MongoDB. I tried filtering the document with the Aggregation Pipeline. So Following line of code works fine:
val filterWithPipeline = customRdd.withPipeline(Seq(Document.parse("{ $match: { id: { $eq : '134' } } }")))
But IF I want to access the "value" item by using the same syntax. It doesn't work.
val filterWithPipeline = customRdd.withPipeline(Seq(Document.parse("{ $match: { value: { $eq : '134' } } }")))
So How can I use the same approach to query the multi level JSON?
Here are couple of ways you can read from mongoDB and filter it
Creating SparkSession
val spark = SparkSession.builder().master("local").appName("Test")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/db.collectionName")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/db.CollectionName")
.getOrCreate
import spark.implicits._
import com.mongodb.spark.sql._
Read as MongoRDD[Document] and filter it
MongoSpark.load(spark.sparkContext).filter(doc => {
val value = doc.get("additionalProperties").asInstanceOf[Document].get("value")
value.toString.toInt > 5
})
Read as Dataframe with spark.read.mongo()
val filterDF = spark.read.mongo().filter($"additionalProperties.value".lt(5))
Output:
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|2.0|[5, 566623232] |2.0 |
+---+--------------------+----+
Hope this helps!
What if you use the dafaframe?
val df = spark.read.json("path")
Here is my example,
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|1 |[566623232, 35] |1 |
|2 |[566623232, 35] |2 |
|3 |[566623232, 1] |3 |
+---+--------------------+----+
and the schema is
root
|-- _id: long (nullable = true)
|-- additionalProperties: struct (nullable = true)
| |-- phone: string (nullable = true)
| |-- value: string (nullable = true)
|-- test: long (nullable = true)
Then,
df.filter(col("additionalProperties").getItem("value").cast("int") > 5)
will give the result such as:
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|1 |[566623232, 35] |1 |
|2 |[566623232, 35] |2 |
+---+--------------------+----+

Nested dynamic schema not working while parsing JSON using pyspark

I am trying to extract certain parameters from a nested JSON (having dynamic schema) and generate a spark dataframe using pyspark.
My code works perfectly for level 1 (key:value) but fails get independent columns for each (key:value) pair that are a part of nested JSON.
JSON schema sample
Note - This is not the exact schema. Its just to give the idea of nested nature of the schema
{
"tweet": {
"text": "RT #author original message"
"user": {
"screen_name": "Retweeter"
},
"retweeted_status": {
"text": "original message".
"user": {
"screen_name": "OriginalTweeter"
},
"place": {
},
"entities": {
},
"extended_entities": {
}
},
},
"entities": {
},
"extended_entities": {
}
}
}
PySpark Code
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True),
StructField("retweeted_status", StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True)]))
])
df = spark.read.schema(schema).json("/user/sagarp/NaMo/data/NaMo2019-02-12_00H.json")
df.show()
Current output - (with real JSON data)
All (keys:values) under nested retweet_status JSON are squashed into 1 single list. eg [text, created_at, entities]
+--------------------+--------------------+--------------------+
| text| created_at| retweeted_status|
+--------------------+--------------------+--------------------+
|RT #Hoosier602: #...|Mon Feb 11 19:04:...|[#CLeroyjnr #Gabr...|
|RT #EgSophie: Oh ...|Mon Feb 11 19:04:...|[Oh cool so do yo...|
|RT #JacobAWohl: #...|Mon Feb 11 19:04:...|[#realDonaldTrump...|
Expected output
I want independent columns for each key. Also, note that you already have a parent level key by the same name text. How will you deal with such instances?
Ideally, I would want columns like "text", "entities", "retweet_status_text", "retweet_status_entities", etc
Your schema is not mapped properly ... please see these posts if you want to manually construct schema (which is recommended if the data doesn't change):
PySpark: How to Update Nested Columns?
https://docs.databricks.com/_static/notebooks/complex-nested-structured.html
Also, if your JSON is multi-line (like your example) then you can ...
read json via multi-line option to get Spark to infer schema
then save nested schema
then read data back in with the correct schema mapping to avoid triggering a Spark job
! cat nested.json
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
getSchema = spark.read.option("multiline", "true").json("nested.json")
extractSchema = getSchema.schema
print(extractSchema)
StructType(List(StructField(array,ArrayType(LongType,true),true),StructField(dict,StructType(List(StructField(extra_key,StringType,true),StructField(key,StringType,true))),true),StructField(int,LongType,true),StructField(string,StringType,true)))
loadJson = spark.read.option("multiline", "true").schema(extractSchema ).json("nested.json")
loadJson.printSchema()
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
loadJson.show(truncate=False)
+---------+----------------------+---+-------+
|array |dict |int|string |
+---------+----------------------+---+-------+
|[1, 2, 3]|[, value1] |1 |string1|
|[2, 4, 6]|[, value2] |2 |string2|
|[3, 6, 9]|[extra_value3, value3]|3 |string3|
+---------+----------------------+---+-------+
Once you have the data loaded with the correct mapping then you can start to transform into a normalized schema via the "dot" notation for nested columns and "explode" to flatten arrays, etc.
loadJson\
.selectExpr("dict.key as key", "dict.extra_key as extra_key").show()
+------+------------+
| key| extra_key|
+------+------------+
|value1| null|
|value2| null|
|value3|extra_value3|
+------+------------+