I am a newbee and trying to resolve the following problem. Any help is highly appreciated.
I have the following Json.
{
"index": "identity",
"type": "identity",
"id": "100000",
"source": {
"link_data": {
"source_Id": "0011245"
},
"attribute_data": {
"first": {
"val": [
true
],
"updated_at": "2011"
},
"second": {
"val": [
true
],
"updated_at": "2010"
}
}
}
}
Attributes under "attribute_data" may vary. it can have another attribute, say "third"
I am expecting the result in below format:
_index _type _id source_Id attribute_data val updated_at
ID ID randomid 00000 first true 2000-08-08T07:51:14Z
ID ID randomid 00000 second true 2010-08-08T07:51:14Z
I tried the following approach.
val df = spark.read.json("sample.json")
val res = df.select("index","id","type","source.attribute_data.first.updated_at", "source.attribute_data.first.val", "source.link_data.source_id");
It just adds new column not the rows as following
index id type updated_at val source_id
identity 100000 identity 2011 [true] 0011245
Try the following:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = spark.read.json("sample.json")
df.select($"id", $"index", $"source.link_data.source_Id".as("source_Id"),$"source.attribute_data.first.val".as("first"), explode($"source.attribute_data.second.val").as("second"), $"type")
.select($"id", $"index", $"source_Id", $"second", explode($"first"), $"type").show
Here you go with the solution. Feel free to ask, if you need to understand anything:
val data = spark.read.json("sample.json")
val readJsonDf = data.select($"index", $"type", $"id", $"source.link_data.source_id".as("source_id"), $"source.attribute_data.*")
readJsonDf.show()
Initial Output:
+--------+--------+------+---------+--------------------+--------------------+
| index| type| id|source_id| first| second|
+--------+--------+------+---------+--------------------+--------------------+
|identity|identity|100000| 0011245|[2011,WrappedArra...|[2010,WrappedArra...|
+--------+--------+------+---------+--------------------+--------------------+
Then I did the dynamic transformation using the following lines of code:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def transposeColumnstoRows(df: DataFrame, constantCols: Seq[String]): DataFrame = {
val (cols, types) = df.dtypes.filter{ case (c, _) => !constantCols.contains(c)}.unzip
//a check if the required columns that needs to be transformed to rows are of the same structure
require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")
val keyColsWIthValues = explode(array(
cols.map(c => struct(lit(c).alias("columnKey"), col(c).alias("value"))): _*
))
df.select(constantCols.map(col(_)) :+ keyColsWIthValues.alias("keyColsWIthValues"): _*)
}
val newDf = transposeColumnstoRows(readJsonDf, Seq("index","type","id","source_id"))
val requiredDf = newDf.select($"index",$"type",$"id",$"source_id",$"keyColsWIthValues.columnKey".as("attribute_data"),$"keyColsWIthValues.value.updated_at".as("updated_at"),$"keyColsWIthValues.value.val".as("val"))
requiredDf.show()
Final Output:
| index| type| id|source_id|attribute_data|updated_at| val|
+--------+--------+------+---------+--------------+----------+------+
|identity|identity|100000| 0011245| first| 2011|[true]|
|identity|identity|100000| 0011245| second| 2010|[true]|
Hope this solves your issue!
Related
I have a json file which looks like this
{
"tags":[
"Real_send",
"stopped"
],
"messages":{
"7c2e9284-993d-4eb4-ad6b-6a2bfcc51060":{
"channel":"channel 1",
"name":"Version 1",
"alert":"\ud83d\ude84 alert 1"
},
"c2cbd05c-5452-476c-bdc7-ac31ed3417f9":{
"channel":"channel 1",
"name":"name 1",
"type":"type 1"
},
"b869886f-0f9c-487f-8a43-abe3d6456678":{
"channel":"channel 2",
"name":"Version 2",
"alert":"\ud83d\ude84 alert 2"
}
}
}
I want the output to look like below
When I print the schema I get the below schema from spark
StructType(List(
StructField(messages,
StructType(List(
StructField(7c2e9284-993d-4eb4-ad6b-6a2bfcc51060,
StructType(List(
StructField(alert,StringType,true),
StructField(channel,StringType,true),
StructField(name,StringType,true))),true),
StructField(b869886f-0f9c-487f-8a43-abe3d6456678,StructType(List(
StructField(alert,StringType,true),
StructField(channel,StringType,true),
StructField(name,StringType,true))),true),
StructField(c2cbd05c-5452-476c-bdc7-ac31ed3417f9,StructType(List(
StructField(channel,StringType,true),
StructField(name,StringType,true),
StructField(type,StringType,true))),true))),true),
StructField(tags,ArrayType(StringType,true),true)))
Basically 7c2e9284-993d-4eb4-ad6b-6a2bfcc51060 should be considered as my ID column
My code looks like:
cols_list_to_select_from_flattened = ['alert', 'channel', 'type', 'name']
df = df \
.select(
F.json_tuple(
F.col('messages'), *cols_list_to_select_from_flattened
)
.alias(*cols_list_to_select_from_flattened))
df.show(1, False)
Error message:
E pyspark.sql.utils.AnalysisException: cannot resolve 'json_tuple(`messages`, 'alert', 'channel', 'type', 'name')' due to data type mismatch: json_tuple requires that all arguments are strings;
E 'Project [json_tuple(messages#0, alert, channel, type, name) AS ArrayBuffer(alert, channel, type, name)]
E +- Relation[messages#0,tags#1] json
I also tried to list all keys like below
df.withColumn("map_json_column", F.posexplode_outer(F.col("messages"))).show()
But got error
E pyspark.sql.utils.AnalysisException: cannot resolve 'posexplode(`messages`)' due to data type mismatch: input to function explode should be array or map type, not struct<7c2e9284-993d-4eb4-ad6b-6a2bfcc51060:struct<alert:string,channel:string,name:string>,b869886f-0f9c-487f-8a43-abe3d6456678:struct<alert:string,channel:string,name:string>,c2cbd05c-5452-476c-bdc7-ac31ed3417f9:struct<channel:string,name:string,type:string>>;
E 'Project [messages#0, tags#1, generatorouter(posexplode(messages#0)) AS map_json_column#5]
E +- Relation[messages#0,tags#1] json
How can I get the desired output?
When reading json you can specify your own schema, instead of message column being a struct type make it a map type and then you can simply explode that column
Here is a self contained example with your data
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()
json_sample = """
{
"tags":[
"Real_send",
"stopped"
],
"messages":{
"7c2e9284-993d-4eb4-ad6b-6a2bfcc51060":{
"channel":"channel 1",
"name":"Version 1",
"alert":"lert 1"
},
"c2cbd05c-5452-476c-bdc7-ac31ed3417f9":{
"channel":"channel 1",
"name":"name 1",
"type":"type 1"
},
"b869886f-0f9c-487f-8a43-abe3d6456678":{
"channel":"channel 2",
"name":"Version 2",
"alert":" alert 2"
}
}
}
"""
data = spark.sparkContext.parallelize([json_sample])
cols_to_select = ['alert', 'channel', 'type', 'name']
# The schema of message entry, only columns
# that are needed to select will be parsed,
# must be nullable based on your data sample
message_schema = StructType([
StructField(col_name, StringType(), True) for col_name in cols_to_select
])
# the complete document schema
json_schema = StructType([
StructField("tags", StringType(), False),
StructField("messages", MapType(StringType(), message_schema, False) ,False),
])
# Read json and parse to specific schema
# Here instead of sample data you can use file path
df = spark.read.schema(json_schema).json(data)
# explode the map column and select the requires columns
df = (
df
.select(F.explode(F.col("messages")))
.select(
F.col("key").alias("id"),
*[F.col(f"value.{col_name}").alias(col_name) for col_name in cols_to_select]
)
)
df.show(truncate=False)
I have a spark dataframe that looks like this:
I want to flatten the columns.
Result should look like this:
Data:
{
"header": {
"message-id": "ID:EL2-202103221753-77777777-88888-9999999999-1:2:1:1:1",
"reply-to": "queue://CaseProcess.v2",
"timestamp": "2021-03-22T20:07:27"
},
"properties": {
"property": [
{
"name": "ELIS_EXCEPTION_MSG",
"value": "The AWS Access Key Id you provided does not exist in our records"
},
{
"name": "ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"value": "1616458043704"
}
]
}
}
You should first rename columns in header then explode properties.property array then pivot and group columns.
Here is an example that produces your wanted result:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import json
if __name__ == "__main__":
spark = SparkSession.builder.appName("Test").getOrCreate()
data = {
"header": {
"message-id": "ID:EL2-202103221753-77777777-88888-9999999999-1:2:1:1:1",
"reply-to": "queue://CaseProcess.v2",
"timestamp": "2021-03-22T20:07:27",
},
"properties": {
"property": [
{
"name": "ELIS_EXCEPTION_MSG",
"value": "The AWS Access Key Id you provided does not exist in our records",
},
{
"name": "ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"value": "1616458043704",
},
]
},
}
sc = spark.sparkContext
df = spark.read.json(sc.parallelize([json.dumps(data)]))
df = df.select(
F.col("header.message-id").alias("message-id"),
F.col("header.reply-to").alias("reply-to"),
F.col("header.timestamp").alias("timestamp"),
F.col("properties"),
)
df = df.withColumn("propertyexploded", F.explode("properties.property"))
df = df.withColumn("propertyname", F.col("propertyexploded")["name"])
df = df.withColumn("propertyvalue", F.col("propertyexploded")["value"])
df = (
df.groupBy("message-id", "reply-to", "timestamp")
.pivot("propertyname")
.agg(F.first("propertyvalue"))
)
df.printSchema()
df.show()
Result:
root
|-- message-id: string (nullable = true)
|-- reply-to: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- ELIS_EXCEPTION_MSG: string (nullable = true)
|-- ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS: string (nullable = true)
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
| message-id| reply-to| timestamp| ELIS_EXCEPTION_MSG|ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS|
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
|ID:EL2-2021032217...|queue://CaseProce...|2021-03-22T20:07:27|The AWS Access Ke...| 1616458043704|
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
Thanks Vlad. I tried your option and it was successful.
In your solution, the properties.property['name'] is dynamic and I liked that.
The only drawback is that the # of rows get multiplied by the # of properties in which when I had 3 rows, it created 36 rows in the flattened df. Of course you pivot back to 3 rows.
The problem i faced is that all the properties get repeated x*12 times. it would have been fine if the properties were small. But I have 2 columns in the stack which can be up to 2K-20K and I have some queues which go over 100K rows. Repeating that over and over again seems a bit of an overkill for my process.
However, I found another solution, with a drawback where the property name can be hard-coded but the repeat of rows is eliminated.
Here is what I ended up using:
XXX = df.rdd.flatMap(lambda x: [( x[1]["destination"].replace("queue://", ""),
x[1]["message-id"].replace("ID:", ""),
x[1]["delivery-mode"],
x[1]["expiration"],
x[1]["priority"],
x[1]["redelivered"],
x[1]["timestamp"],
y[0]["value"],
y[1]["value"],
y[2]["value"],
y[3]["value"],
y[4]["value"],
y[5]["value"],
y[6]["value"],
y[7]["value"],
y[8]["value"],
y[9]["value"],
y[10]["value"],
y[11]["value"],
x[0],
x[3],
x[4],
x[5]
) for y in x[2]
])\
.toDF(["queue",
"message_id",
"delivery_mode",
"expiration",
"priority",
"redelivered",
"timestamp",
"ELIS_EXCEPTION_MSG",
"ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"ELIS_MESSAGE_RETRY_COUNT",
"ELIS_MESSAGE_ORIG_TIMESTAMP",
"ELIS_MDC_TRACER_ID",
"tracestate",
"ELIS_ROOT_CAUSE_EXCEPTION_MSG",
"traceparent",
"ELIS_MESSAGE_TYPE",
"ELIS_EXCEPTION_CLASS",
"newrelic",
"ELIS_EXCEPTION_TRACE",
"body",
"partition_0",
"partition_1",
"partition_2"
])
print(f"... type(XXX): {type(XXX)} | df.count(): {df.count()} | XXX.count(): {XXX.count()}")
output: ... type(XXX): <class 'pyspark.rdd.PipelinedRDD'> | df.count(): 3 | XXX.count(): 3
My column structure is from activeMQ API extracts which means that the column structure is consistent and its OK for my use case to hard-code the column names in the flattened_df
I have a Json file that looks like this
{
"tags": [
{
"1": "NpProgressBarTag",
"2": "userPath",
"3": "screen",
"4": 6,
"12": 9,
"13": "buttonName",
"16": 0,
"17": 10,
"18": 5,
"19": 6,
"20": 1,
"35": 1,
"36": 1,
"37": 4,
"38": 0,
"39": "npChannelGuid",
"40": "npShowGuid",
"41": "npCategoryGuid",
"42": "npEpisodeGuid",
"43": "npAodEpisodeGuid",
"44": "npVodEpisodeGuid",
"45": "npLiveEventGuid",
"46": "npTeamGuid",
"47": "npLeagueGuid",
"48": "npStatus",
"50": 0,
"52": "gupId",
"54": "deviceID",
"55": 1,
"56": 0,
"57": "uiVersion",
"58": 1,
"59": "deviceOS",
"60": 1,
"61": 0,
"62": "channelLineupID",
"63": 2,
"64": "userProfile",
"65": "sessionId",
"66": "hitId",
"67": "actionTime",
"68": "seekTo",
"69": "seekFrom",
"70": "currentPosition"
}
]
}
I tried to create a dataframe using
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "tags" key? all I need is, pull data out of "tags" and apply case class like this
case class ProgLang (id: String, type: String )
I need to convert this json data into dataframe with Two Column names .toDF(id, Type)
Can anyone shed some light on this error?
You may modify the JSON using Circe.
Given that your values are sometimes Strings and other times Numbers, this was quite complex.
import io.circe._, io.circe.parser._, io.circe.generic.semiauto._
val json = """ ... """ // your JSON here.
val doc = parse(json).right.get
val mappedDoc = doc.hcursor.downField("tags").withFocus { array =>
array.mapArray { jsons =>
jsons.map { json =>
json.mapObject { o =>
o.mapValues { v =>
// Cast numbers to strings.
if (v.isString) v else Json.fromString(v.asNumber.get.toString)
}
}
}
}
}
final case class ProgLang(id: String, `type`: String )
final case class Tags(tags: List[Map[String, String]])
implicit val TagsDecoder: Decoder[Tags] = deriveDecoder
val tags = mappedDoc.top.get.as[Tags]
val data = for {
tag <- res29.tags
(id, _type) <- tag
} yield ProgLang(id, _type)
Now you have a List of ProgLang you may create a DataFrame directly from it, save it as a file with each JSON per line, save it as a CSV file, etc...
If the file is very big, you may use fs2 to stream it while transforming, it integrates nicely with Circe.
DISCLAIMER: I am far from being a "pro" with Circe, this seems over-complicated for doing something which seems like a "simple-task", probably there is a better / cleaner way of doing it (maybe using Optics?), but hey! it works! - anyways, if anyone knows a better way to solve this feel free to edit the question or provide yours.
val path = "some/path/to/jsonFile.json"
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json(path)
try following code if your json file not very big
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("some/path/to/jsonFile.json").values)
I'm reading multiple JSON files from a directory; this JSON has multiple items 'cars' in an array. I'm trying to explode and merge the discrete values from the item 'car' to one dataframe.
A JSON file looks like:
{
"cars": {
"items":
[
{
"latitude": 42.0001,
"longitude": 19.0001,
"name": "Alex"
},
{
"latitude": 42.0002,
"longitude": 19.0002,
"name": "Berta"
},
{
"latitude": 42.0003,
"longitude": 19.0003,
"name": "Chris"
},
{
"latitude": 42.0004,
"longitude": 19.0004,
"name": "Diana"
}
]
}
}
My approaches to explode and merge the values to just one dataframe are:
// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()
/* Approach 1 */
// User-defined function to 'zip' two columns
val zip = udf((xs: Seq[Double], ys: Seq[Double]) => xs.zip(ys))
jsonDF.withColumn("vars", explode(zip($"cars.items.latitude", $"cars.items.longitude"))).select($"cars.items.name", $"vars._1".alias("varA"), $"vars._2".alias("varB"))
/* Apporach 2 */
val df = jsonData.select($"cars.items.name", $"cars.items.latitude", $"cars.items.longitude").toDF("name", "latitude", "longitude")
val df1 = df.select(explode(df("name")).alias("name"), df("latitude"), df("longitude"))
val df2 = df1.select(df1("name").alias("name"), explode(df1("latitude")).alias("latitude"), df1("longitude"))
val df3 = df2.select(df2("name"), df2("latitude"), explode(df2("longitude")).alias("longitude"))
As you may see the result of the Approach 1 is just a dataframe of two discrete 'merged' parameters like:
+--------------------+---------+---------+
| name| varA| varB|
+--------------------+---------+---------+
|[Leo, Britta, Gor...|48.161079|11.556778|
|[Leo, Britta, Gor...|48.124666|11.617682|
|[Leo, Britta, Gor...|48.352043|11.788091|
|[Leo, Britta, Gor...| 48.25184|11.636337|
The result for Approach is as follows:
+----+---------+---------+
|name| latitude|longitude|
+----+---------+---------+
| Leo|48.161079|11.556778|
| Leo|48.161079|11.617682|
| Leo|48.161079|11.788091|
| Leo|48.161079|11.636337|
| Leo|48.161079|11.560595|
| Leo|48.161079|11.788632|
(The result is a mapping of each 'name' with each 'latitude' with each 'longitude')
The result should be as follows:
+--------------------+---------+---------+
| name| varA| varB|
+--------------------+---------+---------+
|Leo |48.161079|11.556778|
|Britta |48.124666|11.617682|
|Gorch |48.352043|11.788091|
Do you know how read the files, split and merge the values that each line is just one object?
Thanks you very much for your help!
For getting the expected result you can try following approach:
// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()
// Approach
val df1 = jsonDF.select(explode(df("cars.items")).alias("items"))
val df2 = df1.select("items.name", "items.latitude", "items.longitude")
The above approach will provide you following result:
+-----+--------+---------+
| name|latitude|longitude|
+-----+--------+---------+
| Alex| 42.0001| 19.0001|
|Berta| 42.0002| 19.0002|
|Chris| 42.0003| 19.0003|
|Diana| 42.0004| 19.0004|
+-----+--------+---------+
After reading a JSON result from a web service response:
val jsonResult: JsValue = Json.parse(response.body)
Containing content something like:
{
result: [
["Name 1", "Row1 Val1", "Row1 Val2"],
["Name 2", "Row2 Val1", "Row2 Val2"]
]
}
How can I efficiently map the contents of the result array in the JSON with a list (or something similar) like:
val keys = List("Name", "Val1", "Val2")
To get an array of hashmaps?
Something like this ?
This solution is functional and handles None/Failure cases "properly" (by returning a None)
val j = JSON.parseFull( json ).asInstanceOf[ Option[ Map[ String, List[ List[ String ] ] ] ] ]
val res = j.map { m ⇒
val r = m get "result"
r.map { ll ⇒
ll.foldRight( List(): List[ Map[ String, String ] ] ) { ( l, acc ) ⇒
Map( ( "Name" -> l( 0 ) ), ( "Val1" -> l( 1 ) ), ( "Val2" -> l( 2 ) ) ) :: acc
}
}.getOrElse(None)
}.getOrElse(None)
Note 1: I had to put double quotes around result in the JSON String to get the JSON parser to work
Note 2: the code could look nicer using more "monadic" sugar such as for comprehensions or using applicative functors