Exploding struct column values in pyspark - json

I have a json file which looks like this
{
"tags":[
"Real_send",
"stopped"
],
"messages":{
"7c2e9284-993d-4eb4-ad6b-6a2bfcc51060":{
"channel":"channel 1",
"name":"Version 1",
"alert":"\ud83d\ude84 alert 1"
},
"c2cbd05c-5452-476c-bdc7-ac31ed3417f9":{
"channel":"channel 1",
"name":"name 1",
"type":"type 1"
},
"b869886f-0f9c-487f-8a43-abe3d6456678":{
"channel":"channel 2",
"name":"Version 2",
"alert":"\ud83d\ude84 alert 2"
}
}
}
I want the output to look like below
When I print the schema I get the below schema from spark
StructType(List(
StructField(messages,
StructType(List(
StructField(7c2e9284-993d-4eb4-ad6b-6a2bfcc51060,
StructType(List(
StructField(alert,StringType,true),
StructField(channel,StringType,true),
StructField(name,StringType,true))),true),
StructField(b869886f-0f9c-487f-8a43-abe3d6456678,StructType(List(
StructField(alert,StringType,true),
StructField(channel,StringType,true),
StructField(name,StringType,true))),true),
StructField(c2cbd05c-5452-476c-bdc7-ac31ed3417f9,StructType(List(
StructField(channel,StringType,true),
StructField(name,StringType,true),
StructField(type,StringType,true))),true))),true),
StructField(tags,ArrayType(StringType,true),true)))
Basically 7c2e9284-993d-4eb4-ad6b-6a2bfcc51060 should be considered as my ID column
My code looks like:
cols_list_to_select_from_flattened = ['alert', 'channel', 'type', 'name']
df = df \
.select(
F.json_tuple(
F.col('messages'), *cols_list_to_select_from_flattened
)
.alias(*cols_list_to_select_from_flattened))
df.show(1, False)
Error message:
E pyspark.sql.utils.AnalysisException: cannot resolve 'json_tuple(`messages`, 'alert', 'channel', 'type', 'name')' due to data type mismatch: json_tuple requires that all arguments are strings;
E 'Project [json_tuple(messages#0, alert, channel, type, name) AS ArrayBuffer(alert, channel, type, name)]
E +- Relation[messages#0,tags#1] json
I also tried to list all keys like below
df.withColumn("map_json_column", F.posexplode_outer(F.col("messages"))).show()
But got error
E pyspark.sql.utils.AnalysisException: cannot resolve 'posexplode(`messages`)' due to data type mismatch: input to function explode should be array or map type, not struct<7c2e9284-993d-4eb4-ad6b-6a2bfcc51060:struct<alert:string,channel:string,name:string>,b869886f-0f9c-487f-8a43-abe3d6456678:struct<alert:string,channel:string,name:string>,c2cbd05c-5452-476c-bdc7-ac31ed3417f9:struct<channel:string,name:string,type:string>>;
E 'Project [messages#0, tags#1, generatorouter(posexplode(messages#0)) AS map_json_column#5]
E +- Relation[messages#0,tags#1] json
How can I get the desired output?

When reading json you can specify your own schema, instead of message column being a struct type make it a map type and then you can simply explode that column
Here is a self contained example with your data
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()
json_sample = """
{
"tags":[
"Real_send",
"stopped"
],
"messages":{
"7c2e9284-993d-4eb4-ad6b-6a2bfcc51060":{
"channel":"channel 1",
"name":"Version 1",
"alert":"lert 1"
},
"c2cbd05c-5452-476c-bdc7-ac31ed3417f9":{
"channel":"channel 1",
"name":"name 1",
"type":"type 1"
},
"b869886f-0f9c-487f-8a43-abe3d6456678":{
"channel":"channel 2",
"name":"Version 2",
"alert":" alert 2"
}
}
}
"""
data = spark.sparkContext.parallelize([json_sample])
cols_to_select = ['alert', 'channel', 'type', 'name']
# The schema of message entry, only columns
# that are needed to select will be parsed,
# must be nullable based on your data sample
message_schema = StructType([
StructField(col_name, StringType(), True) for col_name in cols_to_select
])
# the complete document schema
json_schema = StructType([
StructField("tags", StringType(), False),
StructField("messages", MapType(StringType(), message_schema, False) ,False),
])
# Read json and parse to specific schema
# Here instead of sample data you can use file path
df = spark.read.schema(json_schema).json(data)
# explode the map column and select the requires columns
df = (
df
.select(F.explode(F.col("messages")))
.select(
F.col("key").alias("id"),
*[F.col(f"value.{col_name}").alias(col_name) for col_name in cols_to_select]
)
)
df.show(truncate=False)

Related

Pulling specific Parent/Child JSON data with Python

I'm having a difficult time figuring out how to pull specific information from a json file.
So far I have this:
# Import json library
import json
# Open json database file
with open('jsondatabase.json', 'r') as f:
data = json.load(f)
# assign variables from json data and convert to usable information
identifier = data['ID']
identifier = str(identifier)
name = data['name']
name = str(name)
# Collect data from user to compare with data in json file
print("Please enter your numerical identifier and name: ")
user_id = input("Numerical identifier: ")
user_name = input("Name: ")
if user_id == identifier and user_name == name:
print("Your inputs matched. Congrats.")
else:
print("Your inputs did not match our data. Please try again.")
And that works great for a simple JSON file like this:
{
"ID": "123",
"name": "Bobby"
}
But ideally I need to create a more complex JSON file and can't find deeper information on how to pull specific information from something like this:
{
"Parent": [
{
"Parent_1": [
{
"Name": "Bobby",
"ID": "123"
}
],
"Parent_2": [
{
"Name": "Linda",
"ID": "321"
}
]
}
]
}
Here is an example that you might be able to pick apart.
You could either:
Make a custom de-jsonify object_hook as shown below and do something with it. There is a good tutorial here.
Just gobble up the whole dictionary that you get without a custom de-jsonify and drill down into it and make a list or set of the results. (not shown)
Example:
import json
from collections import namedtuple
data = '''
{
"Parents":
[
{
"Name": "Bobby",
"ID": "123"
},
{
"Name": "Linda",
"ID": "321"
}
]
}
'''
Parent = namedtuple('Parent', ['name', 'id'])
def dejsonify(json_str: dict):
if json_str.get("Name"):
parent = Parent(json_str.get('Name'), int(json_str.get('ID')))
return parent
return json_str
res = json.loads(data, object_hook=dejsonify)
print(res)
# then we can do whatever... if you need lookups by name/id,
# we could put the result into a dictionary
all_parents = {(p.name, p.id) : p for p in res['Parents']}
lookup_from_input = ('Bobby', 123)
print(f'found match: {all_parents.get(lookup_from_input)}')
Result:
{'Parents': [Parent(name='Bobby', id=123), Parent(name='Linda', id=321)]}
found match: Parent(name='Bobby', id=123)

pyspark explode json array of dictionary items with key/values pairs into columns

I have a spark dataframe that looks like this:
I want to flatten the columns.
Result should look like this:
Data:
{
"header": {
"message-id": "ID:EL2-202103221753-77777777-88888-9999999999-1:2:1:1:1",
"reply-to": "queue://CaseProcess.v2",
"timestamp": "2021-03-22T20:07:27"
},
"properties": {
"property": [
{
"name": "ELIS_EXCEPTION_MSG",
"value": "The AWS Access Key Id you provided does not exist in our records"
},
{
"name": "ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"value": "1616458043704"
}
]
}
}
You should first rename columns in header then explode properties.property array then pivot and group columns.
Here is an example that produces your wanted result:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import json
if __name__ == "__main__":
spark = SparkSession.builder.appName("Test").getOrCreate()
data = {
"header": {
"message-id": "ID:EL2-202103221753-77777777-88888-9999999999-1:2:1:1:1",
"reply-to": "queue://CaseProcess.v2",
"timestamp": "2021-03-22T20:07:27",
},
"properties": {
"property": [
{
"name": "ELIS_EXCEPTION_MSG",
"value": "The AWS Access Key Id you provided does not exist in our records",
},
{
"name": "ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"value": "1616458043704",
},
]
},
}
sc = spark.sparkContext
df = spark.read.json(sc.parallelize([json.dumps(data)]))
df = df.select(
F.col("header.message-id").alias("message-id"),
F.col("header.reply-to").alias("reply-to"),
F.col("header.timestamp").alias("timestamp"),
F.col("properties"),
)
df = df.withColumn("propertyexploded", F.explode("properties.property"))
df = df.withColumn("propertyname", F.col("propertyexploded")["name"])
df = df.withColumn("propertyvalue", F.col("propertyexploded")["value"])
df = (
df.groupBy("message-id", "reply-to", "timestamp")
.pivot("propertyname")
.agg(F.first("propertyvalue"))
)
df.printSchema()
df.show()
Result:
root
|-- message-id: string (nullable = true)
|-- reply-to: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- ELIS_EXCEPTION_MSG: string (nullable = true)
|-- ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS: string (nullable = true)
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
| message-id| reply-to| timestamp| ELIS_EXCEPTION_MSG|ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS|
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
|ID:EL2-2021032217...|queue://CaseProce...|2021-03-22T20:07:27|The AWS Access Ke...| 1616458043704|
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
Thanks Vlad. I tried your option and it was successful.
In your solution, the properties.property['name'] is dynamic and I liked that.
The only drawback is that the # of rows get multiplied by the # of properties in which when I had 3 rows, it created 36 rows in the flattened df. Of course you pivot back to 3 rows.
The problem i faced is that all the properties get repeated x*12 times. it would have been fine if the properties were small. But I have 2 columns in the stack which can be up to 2K-20K and I have some queues which go over 100K rows. Repeating that over and over again seems a bit of an overkill for my process.
However, I found another solution, with a drawback where the property name can be hard-coded but the repeat of rows is eliminated.
Here is what I ended up using:
XXX = df.rdd.flatMap(lambda x: [( x[1]["destination"].replace("queue://", ""),
x[1]["message-id"].replace("ID:", ""),
x[1]["delivery-mode"],
x[1]["expiration"],
x[1]["priority"],
x[1]["redelivered"],
x[1]["timestamp"],
y[0]["value"],
y[1]["value"],
y[2]["value"],
y[3]["value"],
y[4]["value"],
y[5]["value"],
y[6]["value"],
y[7]["value"],
y[8]["value"],
y[9]["value"],
y[10]["value"],
y[11]["value"],
x[0],
x[3],
x[4],
x[5]
) for y in x[2]
])\
.toDF(["queue",
"message_id",
"delivery_mode",
"expiration",
"priority",
"redelivered",
"timestamp",
"ELIS_EXCEPTION_MSG",
"ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"ELIS_MESSAGE_RETRY_COUNT",
"ELIS_MESSAGE_ORIG_TIMESTAMP",
"ELIS_MDC_TRACER_ID",
"tracestate",
"ELIS_ROOT_CAUSE_EXCEPTION_MSG",
"traceparent",
"ELIS_MESSAGE_TYPE",
"ELIS_EXCEPTION_CLASS",
"newrelic",
"ELIS_EXCEPTION_TRACE",
"body",
"partition_0",
"partition_1",
"partition_2"
])
print(f"... type(XXX): {type(XXX)} | df.count(): {df.count()} | XXX.count(): {XXX.count()}")
output: ... type(XXX): <class 'pyspark.rdd.PipelinedRDD'> | df.count(): 3 | XXX.count(): 3
My column structure is from activeMQ API extracts which means that the column structure is consistent and its OK for my use case to hard-code the column names in the flattened_df

how to convert some JSON attributes into rows using dataframes in spark

I am a newbee and trying to resolve the following problem. Any help is highly appreciated.
I have the following Json.
{
"index": "identity",
"type": "identity",
"id": "100000",
"source": {
"link_data": {
"source_Id": "0011245"
},
"attribute_data": {
"first": {
"val": [
true
],
"updated_at": "2011"
},
"second": {
"val": [
true
],
"updated_at": "2010"
}
}
}
}
Attributes under "attribute_data" may vary. it can have another attribute, say "third"
I am expecting the result in below format:
_index _type _id source_Id attribute_data val updated_at
ID ID randomid 00000 first true 2000-08-08T07:51:14Z
ID ID randomid 00000 second true 2010-08-08T07:51:14Z
I tried the following approach.
val df = spark.read.json("sample.json")
val res = df.select("index","id","type","source.attribute_data.first.updated_at", "source.attribute_data.first.val", "source.link_data.source_id");
It just adds new column not the rows as following
index id type updated_at val source_id
identity 100000 identity 2011 [true] 0011245
Try the following:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = spark.read.json("sample.json")
df.select($"id", $"index", $"source.link_data.source_Id".as("source_Id"),$"source.attribute_data.first.val".as("first"), explode($"source.attribute_data.second.val").as("second"), $"type")
.select($"id", $"index", $"source_Id", $"second", explode($"first"), $"type").show
Here you go with the solution. Feel free to ask, if you need to understand anything:
val data = spark.read.json("sample.json")
val readJsonDf = data.select($"index", $"type", $"id", $"source.link_data.source_id".as("source_id"), $"source.attribute_data.*")
readJsonDf.show()
Initial Output:
+--------+--------+------+---------+--------------------+--------------------+
| index| type| id|source_id| first| second|
+--------+--------+------+---------+--------------------+--------------------+
|identity|identity|100000| 0011245|[2011,WrappedArra...|[2010,WrappedArra...|
+--------+--------+------+---------+--------------------+--------------------+
Then I did the dynamic transformation using the following lines of code:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def transposeColumnstoRows(df: DataFrame, constantCols: Seq[String]): DataFrame = {
val (cols, types) = df.dtypes.filter{ case (c, _) => !constantCols.contains(c)}.unzip
//a check if the required columns that needs to be transformed to rows are of the same structure
require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")
val keyColsWIthValues = explode(array(
cols.map(c => struct(lit(c).alias("columnKey"), col(c).alias("value"))): _*
))
df.select(constantCols.map(col(_)) :+ keyColsWIthValues.alias("keyColsWIthValues"): _*)
}
val newDf = transposeColumnstoRows(readJsonDf, Seq("index","type","id","source_id"))
val requiredDf = newDf.select($"index",$"type",$"id",$"source_id",$"keyColsWIthValues.columnKey".as("attribute_data"),$"keyColsWIthValues.value.updated_at".as("updated_at"),$"keyColsWIthValues.value.val".as("val"))
requiredDf.show()
Final Output:
| index| type| id|source_id|attribute_data|updated_at| val|
+--------+--------+------+---------+--------------+----------+------+
|identity|identity|100000| 0011245| first| 2011|[true]|
|identity|identity|100000| 0011245| second| 2010|[true]|
Hope this solves your issue!

How to lookup data from 2 collections in MongoDB using python

I need read 2 collections data from MongoDB in Python, is there any way to join data in python?
Let's say that we have two collections(tables):
buy_orders
sell_orders
Those tables have the same field 'id_transaction' , and we want to join those tables on this field:
import pymongo
my_client = pymongo.MongoClient('mongodb://localhost:27017/')
my_db = my_client['Orders']
my_collection = my_db['buy_orders']
result = my_collection.aggregate([{
'$lookup' : {'from': 'sell_orders','localField': 'id_transaction','foreignField': 'id_transaction','as': 'results' }
}])
To print results:
for item in result:
print(item)
For more references: MongoDB Docs and PyMongo Docs
Have a look here
from bson.objectid import ObjectId
#the custom_id for reference
custom_id = ObjectId()
#creating user with the role admin
db.users.insert_one({"name": "Boston", "role_id": custom_id})
#Creating role with the custom id
db.roles.insert_one({"_id": custom_id, "name": "Admin")}
#lookup usage
db.users.aggregate([
{
"$lookup":
{
"from": "roles",
"localField": "role_id",
"foreignField": "_id",
"as": "roles"
}
}
])

Postgres JSON data type Rails query

I am using Postgres' json data type but want to do a query/ordering with data that is nested within the json.
I want to order or query with .where on the json data type. For example, I want to query for users that have a follower count > 500 or I want to order by follower or following count.
Thanks!
Example:
model User
data: {
"photos"=>[
{"type"=>"facebook", "type_id"=>"facebook", "type_name"=>"Facebook", "url"=>"facebook.com"}
],
"social_profiles"=>[
{"type"=>"vimeo", "type_id"=>"vimeo", "type_name"=>"Vimeo", "url"=>"http://vimeo.com/", "username"=>"v", "id"=>"1"},
{"bio"=>"I am not a person, but a series of plants", "followers"=>1500, "following"=>240, "type"=>"twitter", "type_id"=>"twitter", "type_name"=>"Twitter", "url"=>"http://www.twitter.com/", "username"=>"123", "id"=>"123"}
]
}
For any who stumbles upon this. I have come up with a list of queries using ActiveRecord and Postgres' JSON data type. Feel free to edit this to make it more clear.
Documentation to the JSON operators used below: https://www.postgresql.org/docs/current/functions-json.html.
# Sort based on the Hstore data:
Post.order("data->'hello' DESC")
=> #<ActiveRecord::Relation [
#<Post id: 4, data: {"hi"=>"23", "hello"=>"22"}>,
#<Post id: 3, data: {"hi"=>"13", "hello"=>"21"}>,
#<Post id: 2, data: {"hi"=>"3", "hello"=>"2"}>,
#<Post id: 1, data: {"hi"=>"2", "hello"=>"1"}>]>
# Where inside a JSON object:
Record.where("data ->> 'likelihood' = '0.89'")
# Example json object:
r.column_data
=> {"data1"=>[1, 2, 3],
"data2"=>"data2-3",
"array"=>[{"hello"=>1}, {"hi"=>2}],
"nest"=>{"nest1"=>"yes"}}
# Nested search:
Record.where("column_data -> 'nest' ->> 'nest1' = 'yes' ")
# Search within array:
Record.where("column_data #>> '{data1,1}' = '2' ")
# Search within a value that's an array:
Record.where("column_data #> '{array,0}' ->> 'hello' = '1' ")
# this only find for one element of the array.
# All elements:
Record.where("column_data ->> 'array' LIKE '%hello%' ") # bad
Record.where("column_data ->> 'array' LIKE ?", "%hello%") # good
According to this http://edgeguides.rubyonrails.org/active_record_postgresql.html#json
there's a difference in using -> and ->>:
# db/migrate/20131220144913_create_events.rb
create_table :events do |t|
t.json 'payload'
end
# app/models/event.rb
class Event < ActiveRecord::Base
end
# Usage
Event.create(payload: { kind: "user_renamed", change: ["jack", "john"]})
event = Event.first
event.payload # => {"kind"=>"user_renamed", "change"=>["jack", "john"]}
## Query based on JSON document
# The -> operator returns the original JSON type (which might be an object), whereas ->> returns text
Event.where("payload->>'kind' = ?", "user_renamed")
So you should try Record.where("data ->> 'status' = 200 ") or the operator that suits your query (http://www.postgresql.org/docs/current/static/functions-json.html).
Your question doesn't seem to correspond to the data you've shown, but if your table is named users and data is a field in that table with JSON like {count:123}, then the query
SELECT * WHERE data->'count' > 500 FROM users
will work. Take a look at your database schema to make sure you understand the layout and check that the query works before complicating it with Rails conventions.
JSON filtering in Rails
Event.create( payload: [{ "name": 'Jack', "age": 12 },
{ "name": 'John', "age": 13 },
{ "name": 'Dohn', "age": 24 }]
Event.where('payload #> ?', '[{"age": 12}]')
#You can also filter by name key
Event.where('payload #> ?', '[{"name": "John"}]')
#You can also filter by {"name":"Jack", "age":12}
Event.where('payload #> ?', {"name":"Jack", "age":12}.to_json)
You can find more about this here