I am trying to extract certain parameters from a nested JSON (having dynamic schema) and generate a spark dataframe using pyspark.
My code works perfectly for level 1 (key:value) but fails get independent columns for each (key:value) pair that are a part of nested JSON.
JSON schema sample
Note - This is not the exact schema. Its just to give the idea of nested nature of the schema
{
"tweet": {
"text": "RT #author original message"
"user": {
"screen_name": "Retweeter"
},
"retweeted_status": {
"text": "original message".
"user": {
"screen_name": "OriginalTweeter"
},
"place": {
},
"entities": {
},
"extended_entities": {
}
},
},
"entities": {
},
"extended_entities": {
}
}
}
PySpark Code
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True),
StructField("retweeted_status", StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True)]))
])
df = spark.read.schema(schema).json("/user/sagarp/NaMo/data/NaMo2019-02-12_00H.json")
df.show()
Current output - (with real JSON data)
All (keys:values) under nested retweet_status JSON are squashed into 1 single list. eg [text, created_at, entities]
+--------------------+--------------------+--------------------+
| text| created_at| retweeted_status|
+--------------------+--------------------+--------------------+
|RT #Hoosier602: #...|Mon Feb 11 19:04:...|[#CLeroyjnr #Gabr...|
|RT #EgSophie: Oh ...|Mon Feb 11 19:04:...|[Oh cool so do yo...|
|RT #JacobAWohl: #...|Mon Feb 11 19:04:...|[#realDonaldTrump...|
Expected output
I want independent columns for each key. Also, note that you already have a parent level key by the same name text. How will you deal with such instances?
Ideally, I would want columns like "text", "entities", "retweet_status_text", "retweet_status_entities", etc
Your schema is not mapped properly ... please see these posts if you want to manually construct schema (which is recommended if the data doesn't change):
PySpark: How to Update Nested Columns?
https://docs.databricks.com/_static/notebooks/complex-nested-structured.html
Also, if your JSON is multi-line (like your example) then you can ...
read json via multi-line option to get Spark to infer schema
then save nested schema
then read data back in with the correct schema mapping to avoid triggering a Spark job
! cat nested.json
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
getSchema = spark.read.option("multiline", "true").json("nested.json")
extractSchema = getSchema.schema
print(extractSchema)
StructType(List(StructField(array,ArrayType(LongType,true),true),StructField(dict,StructType(List(StructField(extra_key,StringType,true),StructField(key,StringType,true))),true),StructField(int,LongType,true),StructField(string,StringType,true)))
loadJson = spark.read.option("multiline", "true").schema(extractSchema ).json("nested.json")
loadJson.printSchema()
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
loadJson.show(truncate=False)
+---------+----------------------+---+-------+
|array |dict |int|string |
+---------+----------------------+---+-------+
|[1, 2, 3]|[, value1] |1 |string1|
|[2, 4, 6]|[, value2] |2 |string2|
|[3, 6, 9]|[extra_value3, value3]|3 |string3|
+---------+----------------------+---+-------+
Once you have the data loaded with the correct mapping then you can start to transform into a normalized schema via the "dot" notation for nested columns and "explode" to flatten arrays, etc.
loadJson\
.selectExpr("dict.key as key", "dict.extra_key as extra_key").show()
+------+------------+
| key| extra_key|
+------+------------+
|value1| null|
|value2| null|
|value3|extra_value3|
+------+------------+
Related
I have this json data:
consumption_json = """
{
"count": 48,
"next": null,
"previous": null,
"results": [
{
"consumption": 0.063,
"interval_start": "2018-05-19T00:30:00+0100",
"interval_end": "2018-05-19T01:00:00+0100"
},
{
"consumption": 0.071,
"interval_start": "2018-05-19T00:00:00+0100",
"interval_end": "2018-05-19T00:30:00+0100"
},
{
"consumption": 0.073,
"interval_start": "2018-05-18T23:30:00+0100",
"interval_end": "2018-05-18T00:00:00+0100"
}
]
}
"""
and I would like to covert the results list to an Arrow table.
I have managed this by first converting it to python data structure, using python's json library, and then converting that to an Arrow table.
import json
consumption_python = json.loads(consumption_json)
results = consumption_python['results']
table = pa.Table.from_pylist(results)
print(table)
pyarrow.Table
consumption: double
interval_start: string
interval_end: string
----
consumption: [[0.063,0.071,0.073]]
interval_start: [["2018-05-19T00:30:00+0100","2018-05-19T00:00:00+0100","2018-05-18T23:30:00+0100"]]
interval_end: [["2018-05-19T01:00:00+0100","2018-05-19T00:30:00+0100","2018-05-18T00:00:00+0100"]]
But, for reasons of performance, I'd rather just use pyarrow exclusively for this.
I can use pyarrow's json reader to make a table.
reader = pa.BufferReader(bytes(consumption_json, encoding='ascii'))
table_from_reader = pa.json.read_json(reader)
And 'results' is a struct nested inside a list. (Actually, everything seems to be nested).
print(table_from_reader['results'].type)
list<item: struct<consumption: double, interval_start: timestamp[s], interval_end: timestamp[s]>>
How do I turn this into a table directly?
following this https://stackoverflow.com/a/72880717/3617057
I can get closer...
import pyarrow.compute as pc
flat = pc.list_flatten(table_from_reader["results"])
print(flat)
[
-- is_valid: all not null
-- child 0 type: double
[
0.063,
0.071,
0.073
]
-- child 1 type: timestamp[s]
[
2018-05-18 23:30:00,
2018-05-18 23:00:00,
2018-05-18 22:30:00
]
-- child 2 type: timestamp[s]
[
2018-05-19 00:00:00,
2018-05-18 23:30:00,
2018-05-17 23:00:00
]
]
flat is a ChunkedArray whose underlying arrays are StructArray. To convert it to a table, you need to convert each chunks to a RecordBatch and concatenate them in a table:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(s)
for s in flat.iterchunks()
]
)
If flat is just a StructArray (not a ChunkedArray), you can call:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(flat)
]
)
I have a spark dataframe that looks like this:
I want to flatten the columns.
Result should look like this:
Data:
{
"header": {
"message-id": "ID:EL2-202103221753-77777777-88888-9999999999-1:2:1:1:1",
"reply-to": "queue://CaseProcess.v2",
"timestamp": "2021-03-22T20:07:27"
},
"properties": {
"property": [
{
"name": "ELIS_EXCEPTION_MSG",
"value": "The AWS Access Key Id you provided does not exist in our records"
},
{
"name": "ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"value": "1616458043704"
}
]
}
}
You should first rename columns in header then explode properties.property array then pivot and group columns.
Here is an example that produces your wanted result:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import json
if __name__ == "__main__":
spark = SparkSession.builder.appName("Test").getOrCreate()
data = {
"header": {
"message-id": "ID:EL2-202103221753-77777777-88888-9999999999-1:2:1:1:1",
"reply-to": "queue://CaseProcess.v2",
"timestamp": "2021-03-22T20:07:27",
},
"properties": {
"property": [
{
"name": "ELIS_EXCEPTION_MSG",
"value": "The AWS Access Key Id you provided does not exist in our records",
},
{
"name": "ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"value": "1616458043704",
},
]
},
}
sc = spark.sparkContext
df = spark.read.json(sc.parallelize([json.dumps(data)]))
df = df.select(
F.col("header.message-id").alias("message-id"),
F.col("header.reply-to").alias("reply-to"),
F.col("header.timestamp").alias("timestamp"),
F.col("properties"),
)
df = df.withColumn("propertyexploded", F.explode("properties.property"))
df = df.withColumn("propertyname", F.col("propertyexploded")["name"])
df = df.withColumn("propertyvalue", F.col("propertyexploded")["value"])
df = (
df.groupBy("message-id", "reply-to", "timestamp")
.pivot("propertyname")
.agg(F.first("propertyvalue"))
)
df.printSchema()
df.show()
Result:
root
|-- message-id: string (nullable = true)
|-- reply-to: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- ELIS_EXCEPTION_MSG: string (nullable = true)
|-- ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS: string (nullable = true)
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
| message-id| reply-to| timestamp| ELIS_EXCEPTION_MSG|ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS|
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
|ID:EL2-2021032217...|queue://CaseProce...|2021-03-22T20:07:27|The AWS Access Ke...| 1616458043704|
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
Thanks Vlad. I tried your option and it was successful.
In your solution, the properties.property['name'] is dynamic and I liked that.
The only drawback is that the # of rows get multiplied by the # of properties in which when I had 3 rows, it created 36 rows in the flattened df. Of course you pivot back to 3 rows.
The problem i faced is that all the properties get repeated x*12 times. it would have been fine if the properties were small. But I have 2 columns in the stack which can be up to 2K-20K and I have some queues which go over 100K rows. Repeating that over and over again seems a bit of an overkill for my process.
However, I found another solution, with a drawback where the property name can be hard-coded but the repeat of rows is eliminated.
Here is what I ended up using:
XXX = df.rdd.flatMap(lambda x: [( x[1]["destination"].replace("queue://", ""),
x[1]["message-id"].replace("ID:", ""),
x[1]["delivery-mode"],
x[1]["expiration"],
x[1]["priority"],
x[1]["redelivered"],
x[1]["timestamp"],
y[0]["value"],
y[1]["value"],
y[2]["value"],
y[3]["value"],
y[4]["value"],
y[5]["value"],
y[6]["value"],
y[7]["value"],
y[8]["value"],
y[9]["value"],
y[10]["value"],
y[11]["value"],
x[0],
x[3],
x[4],
x[5]
) for y in x[2]
])\
.toDF(["queue",
"message_id",
"delivery_mode",
"expiration",
"priority",
"redelivered",
"timestamp",
"ELIS_EXCEPTION_MSG",
"ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"ELIS_MESSAGE_RETRY_COUNT",
"ELIS_MESSAGE_ORIG_TIMESTAMP",
"ELIS_MDC_TRACER_ID",
"tracestate",
"ELIS_ROOT_CAUSE_EXCEPTION_MSG",
"traceparent",
"ELIS_MESSAGE_TYPE",
"ELIS_EXCEPTION_CLASS",
"newrelic",
"ELIS_EXCEPTION_TRACE",
"body",
"partition_0",
"partition_1",
"partition_2"
])
print(f"... type(XXX): {type(XXX)} | df.count(): {df.count()} | XXX.count(): {XXX.count()}")
output: ... type(XXX): <class 'pyspark.rdd.PipelinedRDD'> | df.count(): 3 | XXX.count(): 3
My column structure is from activeMQ API extracts which means that the column structure is consistent and its OK for my use case to hard-code the column names in the flattened_df
I'm completely new to spark, but don't mind if the answer is in python or Scala. I can't show the actual data for privacy reasons, but basically I am reading json files with a structure like this:
{
"EnqueuedTimeUtc": 'some date time',
"Properties": {},
"SystemProperties": {
"connectionDeviceId": "an id",
"some other fields that we don't care about": "data"
},
"Body": {
"device_id": "an id",
"tabs": [
{
"selected": false,
"title": "some title",
"url": "https:...."
},
{"same again, for multiple tabs"}
]
}
}
Most of the data is of no interest. What I want is a Dataframe consisting of the time, device_id, and url. There can be multiple urls for the same device and time, so I'm looking to explode these into one row per url.
| timestamp | device_id | url |
My immediate problem is that when I read this, although it can work out the structure of SystemProperties, Body is just a string, probably because of variation. Perhaps I need to specify the schema, would that help?
root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionAuthMethod: string (nullable = true)
| |-- connectionDeviceGenerationId: string (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
| |-- contentEncoding: string (nullable = true)
| |-- contentType: string (nullable = true)
| |-- enqueuedTime: string (nullable = true)
Any idea of an efficient (there are lots and lots of these records) way to extract urls and associate with the time and device_id? Thanks in advance.
Here's an example for extraction. Basically you can use from_json to convert the Body to something that is more structured, and use explode(transform()) to get the URLs and expand to different rows.
# Sample dataframe
df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
|Body |EnqueuedTimeUtc|SystemProperties|
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
|{"device_id":"an id","tabs":[{"selected":false,"title":"some title","url":"https:1"},{"selected":false,"title":"some title","url":"https:2"}]}|some date time |[an id] |
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
df.printSchema()
root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
# Extract desired properties
df2 = df.selectExpr(
"EnqueuedTimeUtc as timestamp",
"from_json(Body, 'device_id string, tabs array<map<string,string>>') as Body"
).selectExpr(
"timestamp",
"Body.device_id",
"explode(transform(Body.tabs, x -> x.url)) as url"
)
df2.show()
+--------------+---------+-------+
| timestamp|device_id| url|
+--------------+---------+-------+
|some date time| an id|https:1|
|some date time| an id|https:2|
+--------------+---------+-------+
I am using the MongoDb Scala connector for Spark. In the documentation
https://docs.mongodb.com/spark-connector/master/scala/aggregation/
it is mentioned that how to apply filter on the given JSON document. What I am not able to figure out that if we have a multilevel json and we want to apply filter on it how will we access that key/value in the json Document.
Json Document:
{ "_id" : 1, "test" : 1 }
{ "_id" : 2, "test" : 2 }
{ "_id" : 3, "test" : 3 }
{ "_id" : 4, "test" : 4 }
{ "_id" : 5, "test" : 5 }
{ "_id" : 6, "test" : 6 }
Filter Document:
val rdd = MongoSpark.load(sc)
val filteredRdd = rdd.filter(doc => doc.getInteger("test") > 5)
println(filteredRdd.count)
println(filteredRdd.first.toJson)
Multilevel Json Document
{
"_id": 1,
"test": 1,
"additionalProperties": {
"value": "35",
"phone": "566623232"
}
}
Problem Statement:
I want to filter on the basis of "value" attribute but I don't know how to access it. I tried to do following but it is not working.
val filteredRdd = rdd.filter(doc => doc.getInteger("value") > 5)
val filteredRdd = rdd.filter(doc => doc.getInteger("additionalProperties.value") > 5)
Can anybody guide me that how can I access the "value" attribute? What would be the right syntax.
Some Other Options that I have tried:
According to the official documentation of Scala Connector for Spark of MongoDB. I tried filtering the document with the Aggregation Pipeline. So Following line of code works fine:
val filterWithPipeline = customRdd.withPipeline(Seq(Document.parse("{ $match: { id: { $eq : '134' } } }")))
But IF I want to access the "value" item by using the same syntax. It doesn't work.
val filterWithPipeline = customRdd.withPipeline(Seq(Document.parse("{ $match: { value: { $eq : '134' } } }")))
So How can I use the same approach to query the multi level JSON?
Here are couple of ways you can read from mongoDB and filter it
Creating SparkSession
val spark = SparkSession.builder().master("local").appName("Test")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/db.collectionName")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/db.CollectionName")
.getOrCreate
import spark.implicits._
import com.mongodb.spark.sql._
Read as MongoRDD[Document] and filter it
MongoSpark.load(spark.sparkContext).filter(doc => {
val value = doc.get("additionalProperties").asInstanceOf[Document].get("value")
value.toString.toInt > 5
})
Read as Dataframe with spark.read.mongo()
val filterDF = spark.read.mongo().filter($"additionalProperties.value".lt(5))
Output:
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|2.0|[5, 566623232] |2.0 |
+---+--------------------+----+
Hope this helps!
What if you use the dafaframe?
val df = spark.read.json("path")
Here is my example,
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|1 |[566623232, 35] |1 |
|2 |[566623232, 35] |2 |
|3 |[566623232, 1] |3 |
+---+--------------------+----+
and the schema is
root
|-- _id: long (nullable = true)
|-- additionalProperties: struct (nullable = true)
| |-- phone: string (nullable = true)
| |-- value: string (nullable = true)
|-- test: long (nullable = true)
Then,
df.filter(col("additionalProperties").getItem("value").cast("int") > 5)
will give the result such as:
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|1 |[566623232, 35] |1 |
|2 |[566623232, 35] |2 |
+---+--------------------+----+
I'm reading multiple JSON files from a directory; this JSON has multiple items 'cars' in an array. I'm trying to explode and merge the discrete values from the item 'car' to one dataframe.
A JSON file looks like:
{
"cars": {
"items":
[
{
"latitude": 42.0001,
"longitude": 19.0001,
"name": "Alex"
},
{
"latitude": 42.0002,
"longitude": 19.0002,
"name": "Berta"
},
{
"latitude": 42.0003,
"longitude": 19.0003,
"name": "Chris"
},
{
"latitude": 42.0004,
"longitude": 19.0004,
"name": "Diana"
}
]
}
}
My approaches to explode and merge the values to just one dataframe are:
// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()
/* Approach 1 */
// User-defined function to 'zip' two columns
val zip = udf((xs: Seq[Double], ys: Seq[Double]) => xs.zip(ys))
jsonDF.withColumn("vars", explode(zip($"cars.items.latitude", $"cars.items.longitude"))).select($"cars.items.name", $"vars._1".alias("varA"), $"vars._2".alias("varB"))
/* Apporach 2 */
val df = jsonData.select($"cars.items.name", $"cars.items.latitude", $"cars.items.longitude").toDF("name", "latitude", "longitude")
val df1 = df.select(explode(df("name")).alias("name"), df("latitude"), df("longitude"))
val df2 = df1.select(df1("name").alias("name"), explode(df1("latitude")).alias("latitude"), df1("longitude"))
val df3 = df2.select(df2("name"), df2("latitude"), explode(df2("longitude")).alias("longitude"))
As you may see the result of the Approach 1 is just a dataframe of two discrete 'merged' parameters like:
+--------------------+---------+---------+
| name| varA| varB|
+--------------------+---------+---------+
|[Leo, Britta, Gor...|48.161079|11.556778|
|[Leo, Britta, Gor...|48.124666|11.617682|
|[Leo, Britta, Gor...|48.352043|11.788091|
|[Leo, Britta, Gor...| 48.25184|11.636337|
The result for Approach is as follows:
+----+---------+---------+
|name| latitude|longitude|
+----+---------+---------+
| Leo|48.161079|11.556778|
| Leo|48.161079|11.617682|
| Leo|48.161079|11.788091|
| Leo|48.161079|11.636337|
| Leo|48.161079|11.560595|
| Leo|48.161079|11.788632|
(The result is a mapping of each 'name' with each 'latitude' with each 'longitude')
The result should be as follows:
+--------------------+---------+---------+
| name| varA| varB|
+--------------------+---------+---------+
|Leo |48.161079|11.556778|
|Britta |48.124666|11.617682|
|Gorch |48.352043|11.788091|
Do you know how read the files, split and merge the values that each line is just one object?
Thanks you very much for your help!
For getting the expected result you can try following approach:
// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()
// Approach
val df1 = jsonDF.select(explode(df("cars.items")).alias("items"))
val df2 = df1.select("items.name", "items.latitude", "items.longitude")
The above approach will provide you following result:
+-----+--------+---------+
| name|latitude|longitude|
+-----+--------+---------+
| Alex| 42.0001| 19.0001|
|Berta| 42.0002| 19.0002|
|Chris| 42.0003| 19.0003|
|Diana| 42.0004| 19.0004|
+-----+--------+---------+