Pyspark - Parse Nested JSON into Dataframe - json

I have a json file which looks something like this:
{"customer": 10, "date": "2017.04.06 12:09:32", "itemList": [{"item": "20126907_EA", "price": 1.88, "quantity": 1.0}, {"item": "20185742_EA", "price": 0.99, "quantity": 1.0}, {"item": "20138681_EA", "price": 1.79, "quantity": 1.0}, {"item": "20049778001_EA", "price": 2.47, "quantity": 1.0}, {"item": "20419715007_EA", "price": 3.33, "quantity": 1.0}, {"item": "20321434_EA", "price": 2.47, "quantity": 1.0}, {"item": "20068076_KG", "price": 28.24, "quantity": 10.086}, {"item": "20022893002_EA", "price": 1.77, "quantity": 1.0}, {"item": "20299328003_EA", "price": 1.25, "quantity": 1.0}], "store": "825f9cd5f0390bc77c1fed3c94885c87"}
I read it using this code :
transaction_df = spark \
.read \
.option("multiline", "true") \
.json("../transactions.txt")
Output I have :
+--------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+
|customer|date |itemList |store |
+--------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+
|10 |2017.04.06 12:09:32|[{20126907_EA, 1.88, 1.0}, {20185742_EA, 0.99, 1.0}, {20138681_EA, 1.79, 1.0}, {20049778001_EA, 2.47, 1.0}, {20419715007_EA, 3.33, 1.0}, {20321434_EA, 2.47, 1.0}, {20068076_KG, 28.24, 10.086}, {20022893002_EA, 1.77, 1.0}, {20299328003_EA, 1.25, 1.0}]|825f9cd5f0390bc77c1fed3c94885c87|
+--------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+
Output I am looking for :
+--------+--------------+
|customer| itemList|
+--------+--------------+
| 10| 20126907_EA|
| 10| 20185742_EA|
| 10| 20138681_EA|
| 10|20049778001_EA|
| 10|20419715007_EA|
| 10| 20321434_EA|
| 10| 20068076_KG|
| 10|20022893002_EA|
| 10|20299328003_EA|
+--------+--------------+
Basically I am looking for customer and number of items (10 for customer id and its respective purchased item)

You could explode the array and select an item for each row.
Example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.read.json("test.json", multiLine=True)
df = df.withColumn("tmp", F.explode_outer("itemList"))
df = df.select(["customer", "tmp.item"])
df.show(10, False)
df.printSchema()
Output:
+--------+--------------+
|customer|item |
+--------+--------------+
|10 |20126907_EA |
|10 |20185742_EA |
|10 |20138681_EA |
|10 |20049778001_EA|
|10 |20419715007_EA|
|10 |20321434_EA |
|10 |20068076_KG |
|10 |20022893002_EA|
|10 |20299328003_EA|
+--------+--------------+
root
|-- customer: long (nullable = true)
|-- item: string (nullable = true)

I am going to solve this with Spark SQL instead of dataframes. Both solutions work. However, this has an important enhancement.
#
# 1 - create data file
#
# raw data
json_str = """
{"customer": 10, "date": "2017.04.06 12:09:32", "itemList": [{"item": "20126907_EA", "price": 1.88, "quantity": 1.0}, {"item": "20185742_EA", "price": 0.99, "quantity": 1.0}, {"item": "20138681_EA", "price": 1.79, "quantity": 1.0}, {"item": "20049778001_EA", "price": 2.47, "quantity": 1.0}, {"item": "20419715007_EA", "price": 3.33, "quantity": 1.0}, {"item": "20321434_EA", "price": 2.47, "quantity": 1.0}, {"item": "20068076_KG", "price": 28.24, "quantity": 10.086}, {"item": "20022893002_EA", "price": 1.77, "quantity": 1.0}, {"item": "20299328003_EA", "price": 1.25, "quantity": 1.0}], "store": "825f9cd5f0390bc77c1fed3c94885c87"}
"""
# raw file
dbutils.fs.put("/temp/transactions.json", json_str, True)
The above code writes a temporary file.
#
# 2 - read data file w/schema
#
# use library
from pyspark.sql.types import *
# define schema
schema = StructType([
StructField("customer", StringType(), True),
StructField("date", StringType(), True),
StructField("itemList",
ArrayType(
StructType([
StructField("item", StringType(), True),
StructField("price", DoubleType(), True),
StructField("quantity", DoubleType(), True)
])
), True),
StructField("store", StringType(), True)
])
# read in the data
df1 = spark \
.read \
.schema(schema) \
.option("multiline", "true") \
.json("/temp/transactions.json")
# make temp hive view
df1.createOrReplaceTempView("transaction_data")
This is the most important part of the solution, always use a schema with a text file. You do not want the processing code guessing at the format. This will take considerable time if the file is large.
We can use the "%sql" magic command to execute a Spark SQL statement. Use the dot notation to reference an element within the array of structures.
If you want to have this data on each row, use the explode command.
Now that we have the final format, we can write the data to disk (optional).

Related

Read complex JSON to extract key values

I have a JSON and I'm trying to read part of it to extract keys and values.
Assuming response is my JSON data, here is my code:
data_dump = json.dumps(response)
data = json.loads(data_dump)
Here my data object becomes a list and I'm trying to get the keys as below
id = [key for key in data.keys()]
This fails with the error:
A list object does not have an attribute keys**. How can I get over this to get my below output?
Here is my JSON:
{
"1": {
"task": [
"wakeup",
"getready"
]
},
"2": {
"task": [
"brush",
"shower"
]
},
"3": {
"task": [
"brush",
"shower"
]
},
"activites": ["standup", "play", "sitdown"],
"statuscheck": {
"time": 60,
"color": 1002,
"change(me)": 9898
},
"action": ["1", "2", "3", "4"]
}
The output I need is as below. I do not need data from the rest of JSON.
id
task
1
wakeup, getready
2
brush , shower
If you know that the keys you need are "1" and "2", you could try reading the JSON string as a dataframe, unpivoting it, exploding and grouping:
from pyspark.sql import functions as F
df = (spark.read.json(sc.parallelize([data_dump]))
.selectExpr("stack(2, '1', `1`, '2', `2`) (id, task)")
.withColumn('task', F.explode('task.task'))
.groupBy('id').agg(F.collect_list('task').alias('task'))
)
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
However, it may be easier to deal with it in Python:
data = json.loads(data_dump)
data2 = [(k, v['task']) for k, v in data.items() if k in ['1', '2']]
df = spark.createDataFrame(data2, ['id', 'task'])
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+

how to extract and modify inner array objects with parent object data in jq

We are tying to format a json similar to this:
[
{"id": 1,
"type": "A",
"changes": [
{"id": 12},
{"id": 13}
],
"wanted_key": "good",
"unwanted_key": "aaa"
},
{"id": 2,
"type": "A",
"unwanted_key": "aaa"
},
{"id": 3,
"type": "B",
"changes": [
{"id": 31},
{"id": 32}
],
"unwanted_key": "aaa",
"unwanted_key2": "aaa"
},
{"id": 4,
"type": "B",
"unwanted_key3": "aaa"
},
null,
null,
{"id": 7}
]
into something like this:
[
{
"id": 1,
"type": "A",
"wanted_key": true # every record must have this key/value
},
{
"id": 12, # note: this was in the "changes" property of record id 1
"type": "A", # type should be the same type than record id 1
"wanted_key": true
},
{
"id": 13, # note: this was in the "changes" property of record id 1
"type": "A", # type should be the same type than record id 1
"wanted_key": true
},
{
"id": 2,
"type": "A",
"wanted_key": true
},
{
"id": 3,
"type": "B",
"wanted_key": true
},
{
"id": 31, # note: this was in the "changes" property of record id 3
"type": "B", # type should be the same type than record id 3
"wanted_key": true
},
{
"id": 32, # note: this was in the "changes" property of record id 3
"type": "B", # type should be the same type than record id 3
"wanted_key": true
},
{
"id": 4,
"type": "B",
"wanted_key": true
},
{
"id": 7,
"type": "UNKN", # records without a type should have this type
"wanted_key": true
}
]
So far, I've been able to:
remove null records
obtain the keys we need with their default
give records without a type a default type
What we are missing:
from records having a changes key, create new records with the type of their parent record
join all records in a single array
Unfortunately we are not entirely sure how to proceed... Any help would be appreciated.
So far our jq goes like this:
del(..|nulls) | map({id, type: (.type // "UNKN"), wanted_key: (true)}) | del(..|nulls)
Here's our test code:
https://jqplay.org/s/eLAWwP1ha8P
The following should work:
map(select(values))
| map(., .type as $type | (.changes[]? + {$type}))
| map({id, type: (.type // "UNKN"), wanted_key: true})
Only select non-null values
Return the original items followed by their inner changes array (+ outer type)
Extract 3 properties for output
Multiple map calls can usually be combined, so this becomes:
map(
select(values)
| ., (.type as $type | (.changes[]? + {$type}))
| {id, type: (.type // "UNKN"), wanted_key: true}
)
Another option without variables:
map(
select(values)
| ., .changes[]? + {type}
| {id, type: (.type // "UNKN"), wanted_key: true}
)
# or:
map(select(values))
| map(., .changes[]? + {type})
| map({id, type: (.type // "UNKN"), wanted_key: true})
or even with a separate normalization step for the unknown type:
map(select(values))
| map(.type //= "UNKN")
| map(., .changes[]? + {type})
| map({id, type, wanted_key: true})
# condensed to a single line:
map(select(values) | .type //= "UNKN" | ., .changes[]? + {type} | {id, type, wanted_key: true})
Explanation:
Select only non-null values from the array
If type is not set, create the property with value "UNKN"
Produce the original array items, followed by their nested changes elements extended with the parent type
Reshape objects to only contain properties id, type, and wanted_key.
Here's one way:
map(
select(values)
| (.type // "UNKN") as $type
| ., .changes[]?
| {id, $type, wanted_key: true}
)
[
{
"id": 1,
"type": "A",
"wanted_key": true
},
{
"id": 12,
"type": "A",
"wanted_key": true
},
{
"id": 13,
"type": "A",
"wanted_key": true
},
{
"id": 2,
"type": "A",
"wanted_key": true
},
{
"id": 3,
"type": "B",
"wanted_key": true
},
{
"id": 31,
"type": "B",
"wanted_key": true
},
{
"id": 32,
"type": "B",
"wanted_key": true
},
{
"id": 4,
"type": "B",
"wanted_key": true
},
{
"id": 7,
"type": "UNKN",
"wanted_key": true
}
]
Demo
Something like below should work
map(
select(type == "object") |
( {id}, {id : ( .changes[]? .id )} ) +
{ type: (.type // "UNKN"), wanted_key: true }
)
jq play - demo

Adding more data in array of object in PostgreSQL

I have a table of cart with 2 columns (user_num, data).
user_num will have the phone number of user and
data will have an array of object like [{ "id": 1, "quantity": 1 }, { "id": 2, "quantity": 2 }, { "id": 3, "quantity": 3 }] here id is product id.
user_num | data
----------+--------------------------------------------------------------------------------------
1 | [{ "id": 1, "quantity": 1 }, { "id": 2, "quantity": 2 }, { "id": 3, "quantity": 3 }]
I want to add more data of products in above array of objects in PostgreSQL.
Thanks!
To add the value use the JSONB array append operator ||
Demo
update
test
set
data = data || '[{"id": 4, "quantity": 4}, {"id": 5, "quantity": 5}]'
where
user_num = 1;

Json Iteration in spark

Input Json file
{
"CarBrands": [{
"model": "audi",
"make": " (YEAR == \"2009\" AND CONDITION in (\"Y\") AND RESALE in (\"2015\")) ",
"service": {
"first": null,
"second": [],
"third": []
},
"dealerspot": [{
"dealername": [
"\"first\"",
"\"abc\""
]
},
{
"dealerlat": [
"\"45.00\"",
"\"38.00\""
]
}
],
"type": "ok",
"plate": true
},
{
"model": "bmw",
"make": " (YEAR == \"2010\" AND CONDITION OR (\"N\") AND RESALE in (\"2016\")) ",
"service": {
"first": null,
"second": [],
"third": []
},
"dealerspot": [{
"dealerlat": [
"\"99.00\"",
"\"38.00\""
]
},
{
"dealername": [
"\"sports\"",
"\"abc\""
]
}
],
"type": "ok",
"plate": true
},
{
"model": "toy",
"make": " (YEAR == \"2013\" AND CONDITION in (\"Y\") AND RESALE in (\"2018\")) ",
"service": {
"first": null,
"second": [],
"third": []
},
"dealerspot": [{
"dealerlat": [
"\"35.00\"",
"\"38.00\""
]
},
{
"dealername": [
"\"nelson\"",
"\"abc\""
]
}
],
"type": "ok",
"plate": true
}
]
}
expected output
+-------+-------------+-----------+
model | dealername | dealerlat |
--------+-------------+-----------+
audi | first | 45 |
bmw | sports | 99 |
toy | nelson | 35 |
--------+-------------+-----------+
import sparkSession.implicits._
val tagsDF = sparkSession.read.option("multiLine", true).option("inferSchema", true).json("src/main/resources/carbrands.json");
val df = tagsDF.select(explode($"CarBrands") as "car_brands")
val dfd = df.withColumn("_tmp", split($"car_brands.make", "\"")).select($"car_brands.model".as("model"),$"car_brands.dealerspot.dealername"(0)(0).as("dealername"),$"car_brands.dealerspot.dealerlat"(0)(0).as("dealerlat"))
note : since dealername and dealerlat position is not fixed, the index (0)(0) doesnt produce the desired output. please help
You can convert dealerspot into JSON string and then use JSONPath with get_json_object():
import org.apache.spark.sql.functions.{get_json_object,to_json,trim,explode}
val df1 = (tagsDF.withColumn("car_brands", explode($"CarBrands"))
.select("car_brands.*")
.withColumn("dealerspot", to_json($"dealerspot")))
//+--------------------+--------------------+-----+-----+----------+----+
//| dealerspot| make|model|plate| service|type|
//+--------------------+--------------------+-----+-----+----------+----+
//|[{"dealername":["...| (YEAR == "2009" ...| audi| true|[, [], []]| ok|
//|[{"dealerlat":["\...| (YEAR == "2010" ...| bmw| true|[, [], []]| ok|
//|[{"dealerlat":["\...| (YEAR == "2013" ...| toy| true|[, [], []]| ok|
//+--------------------+--------------------+-----+-----+----------+----+
df1.select(
$"model"
, trim(get_json_object($"dealerspot", "$[*].dealername[0]"), "\"\\") as "dealername"
, trim(get_json_object($"dealerspot", "$[*].dealerlat[0]"), "\"\\") as "dealerlat"
).show
//+-----+----------+---------+
//|model|dealername|dealerlat|
//+-----+----------+---------+
//| audi| first| 45.00|
//| bmw| sports| 99.00|
//| toy| nelson| 35.00|
//+-----+----------+---------+

flatten nested data structure in Spark

I have the following dataframe:
df.show()
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
| address| coordinates| id|latitude|longitude| name|position| json|
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
|Balfour St / Brun...|[-27.463431, 15.352472|79.0| null| null|79 - BALFOUR ST /...| null|[-27.463431, 153.041031]|
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
I want to flatten the json column.
I did :
val jsonSchema = StructType(Seq(
StructField("latitude", DoubleType, nullable = true),
StructField("longitude", DoubleType, nullable = true)))
val a = df.select(from_json(col("json"), jsonSchema) as "content")
but
a.show() gives me :
+-------+
|content|
+-------+
| null|
+-------+
Any idea how to parse json col properly and get content col in second dataframe (a) not null ?
Raw data is presented as :
{
"id": 79,
"name": "79 - BALFOUR ST / BRUNSWICK ST",
"address": "Balfour St / Brunswick St",
"coordinates": {
"latitude": -27.463431,
"longitude": 153.041031
}
}
Thanks a lot
Problem is your schema. You are trying to access a nested collection values like a regular value. I made changes to your schema and it worked for me.
val df = spark.createDataset(
"""
|{
| "id": 79,
| "name": "79 - BALFOUR ST / BRUNSWICK ST",
| "address": "Balfour St / Brunswick St",
| "coordinates": {
| "latitude": -27.463431,
| "longitude": 153.041031
| }
| }
""".stripMargin :: Nil)
val jsonSchema = StructType(Seq(
StructField("name", StringType, nullable = true),
StructField("coordinates",
StructType(Seq(
StructField("latitude", DoubleType, true)
,
StructField("longitude", DoubleType, true)
)), true)
)
)
val a = df.select(from_json(col("value"), jsonSchema) as "content")
a.show(false)
Output
+--------------------------------------------------------+
|content |
+--------------------------------------------------------+
|[79 - BALFOUR ST / BRUNSWICK ST,[-27.463431,153.041031]]|
+--------------------------------------------------------+