Spark Sql Flatten Json - json

I have a JSON which looks like this
{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley","year":2012}]}
I want to store output in a csv file like this:
Michael,{"sname":"stanford", "year":2010}
Michael,{"sname":"berkeley", "year":2012}
I have tried the following:
val people = sqlContext.read.json("people.json")
val flattened = people.select($"name", explode($"schools").as("schools_flat"))
The above code does not give schools_flat as a json.
Any ides on how to get the expected output.
Thanks

You need to specify schema explicitly to read the json file in the desired way.
In this case it would be like this:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
case class json_schema_class( cities: String, name : String, schools: Array[String])
var json_schema = ScalaReflection.schemaFor[json_schema_class].dataType.asInstanceOf[StructType]
var people = sqlContext.read.schema( json_schema ).json("people.json")
var flattened = people.select($"name", explode($"schools").as("schools_flat"))
The 'flattened' dataframe is like this:
+-------+--------------------+
| name| schools_flat|
+-------+--------------------+
|Michael|{"sname":"stanfor...|
|Michael|{"sname":"berkele...|
+-------+--------------------+

Related

convert struct to array in spark data frame

I have a dataframe in spark like below.
{"emp_id":1,"emp_name":"John","cust_id":"c1","cust_detail":[{"name":"abc","acc_no":123,"mobile":000},{"name":"abc","acc_no":123,"mobile":111},{"name":"abc","acc_no":123,"mobile":222}]}
I am looking for the output like below.
{"emp_id":1,"emp_name":"John","cust_id":"c1","cust_detail":[{"name":["abc"],"acc_no":[123],"mobile":[000,123,222]}
This is what you want something:
First Explode the column & then aggregate back.
val spark:SparkSession = SparkSession.builder().master("local[1]")
.appName("learn")
.getOrCreate()
val inputdf = spark.read.option("multiline","false").json("C:\\Users\\User\\OneDrive\\Desktop\\source_file.txt")
val newdf1 = inputdf.withColumn("cust_detail_exploded",explode(col("cust_detail"))).drop("cust_detail")
val newdf2 = newdf1.select( "cust_id", "emp_name","emp_id","cust_detail_exploded.mobile", "cust_detail_exploded.acc_no","cust_detail_exploded.name")
val newdf3 = newdf2.groupBy("cust_id").agg(array(struct(collect_set(col("mobile")).as("mobile"),collect_set(col("acc_no")).as("acc_no"),collect_set(col("name")).as("name"))).as("cust_detail") )
newdf3.printSchema()
newdf3.write.json("C:\\Users\\User\\OneDrive\\Desktop\\newww.txt")
Output:
{"cust_id":"c1","cust_detail":[{"mobile":["111","000","222"],"acc_no":["123"],"name":["abc"]}]}

Json string written to Kafka using Spark is not converted properly on reading

I read a .csv file to create a data frame and I want to write the data to a kafka topic. The code is the following
df = spark.read.format("csv").option("header", "true").load(f'{file_location}')
kafka_df = df.selectExpr("to_json(struct(*)) AS value").selectExpr("CAST(value AS STRING)")
kafka_df.show(truncate=False)
And the data frame looks like this:
value
"{""id"":""d215e9f1-4d0c-42da-8f65-1f4ae72077b3"",""latitude"":""-63.571457254062715"",""longitude"":""-155.7055842710919""}"
"{""id"":""ca3d75b3-86e3-438f-b74f-c690e875ba52"",""latitude"":""-53.36506636464281"",""longitude"":""30.069167069917597""}"
"{""id"":""29e66862-9248-4af7-9126-6880ceb3b45f"",""latitude"":""-23.767505281795835"",""longitude"":""174.593140405442""}"
"{""id"":""451a7e21-6d5e-42c3-85a8-13c740a058a9"",""latitude"":""13.02054867061598"",""longitude"":""20.328402498420786""}"
"{""id"":""09d6c11d-7aae-4d17-8cd8-183157794893"",""latitude"":""-81.48976715040848"",""longitude"":""1.1995769642056189""}"
"{""id"":""393e8760-ef40-482a-a039-d263af3379ba"",""latitude"":""-71.73949722379649"",""longitude"":""112.59922770487054""}"
"{""id"":""d6db8fcf-ee83-41cf-9ec2-5c2909c18534"",""latitude"":""-4.034680969008576"",""longitude"":""60.59645511854336""}"
After I wrote it to Kafka I want to read it and transform the binary data from column "value" back to json string but the result is that the value contains only the id, not the whole string. Any ideea why?
from pyspark.sql import functions as F
df = consume_from_event_hub(topic, bootstrap_servers, config, consumer_group)
string_df = df.select(F.col("value").cast("string"))
string_df.display()
value
794541bc-30e6-4c16-9cd0-3c5c8995a3a4
20ea5b50-0baa-47e3-b921-f9a3ac8873e2
598d2fc1-c919-4498-9226-dd5749d92fc5
86cd5b2b-1c57-466a-a3c8-721811ab6959
807de968-c070-4b8b-86f6-00a865474c35
e708789c-e877-44b8-9504-86fd9a20ef91
9133a888-2e8d-4a5a-87ce-4a53e63b67fc
cd5e3e0d-8b02-45ee-8634-7e056d49bf3b
the CSV the format is this
id,latitude,longitude
bd6d98e1-d1da-4f41-94ba-8dbd8c8fce42,-86.06318155350924,-108.14300138138589
c39e84c6-8d7b-4cc5-b925-68a5ea406d52,74.20752175171859,-129.9453606091319
011e5fb8-6ab7-4ee9-97bb-acafc2c71e15,19.302250885973592,-103.2154291337162
You need to remove selectExpr("CAST(value AS STRING)") since to_json already returns a string column
from pyspark.sql.functions import col, to_json, struct
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f'{file_location}')
kafka_df = df.select(to_json(struct(col("*"))).alias("value"))
kafka_df.show(truncate=False)
I'm not sure what's wrong with the consumer. That should have worked unless consume_from_event_hub does something specifically to extract the ID column

How to create multiple DataFrames from a multiple lists in Scala Spark

I'm trying to create multiple DataFrames from the two lists below,
val paths = ListBuffer("s3://abc_xyz_tableA.json",
"s3://def_xyz_tableA.json",
"s3://abc_xyz_tableB.json",
"s3://def_xyz_tableB.json",
"s3://abc_xyz_tableC.json",....)
val tableNames = ListBuffer("tableA","tableB","tableC","tableD",....)
I want to create different dataframes using the table names by bringing all the common table name ending s3 paths together as they have the unique schema.
so for example if the tables and paths related to it are brought together then -
"tableADF" will have all the data from these paths "s3://abc_xyz_tableA.json", "s3://def_xyz_tableA.json" as they have "tableA" in the path
"tableBDF" will have all the data from these paths "s3://abc_xyz_tableB.json", "s3://def_xyz_tableB.json" as they have "tableB" in the path
and so on there can be many tableNames and Paths
I'm trying different approaches but not successful yet.
Any leads in achieving the desired solution will be of great help. Thanks!
using input_file_name() udf, you can filter based on the file names to get the dataframe for each file/file patterns
import org.apache.spark.sql.functions._
import spark.implicits._
var df = spark.read.format("json").load("s3://data/*.json")
df = df.withColumn(
"input_file", input_file_name()
)
val tableADF= df.filter($"input_file".endsWith("tableA.json"))
val tableBDF= df.filter($"input_file".endsWith("tableB.json"))
If the file post fix name list is pretty long then you an use something as below,
Also find the code explanation inline
import org.apache.spark.sql.functions._
object DFByFileName {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load your JSON data
var df = spark.read.format("json").load("s3://data/*.json")
//Add a column with file name
df = df.withColumn(
"input_file", (input_file_name())
)
//Extract unique file postfix from the file names in a List
val fileGroupList = df.select("input_file").map(row => {
val fileName = row.getString(0)
val index1 = fileName.lastIndexOf("_")
val index2 = fileName.lastIndexOf(".")
fileName.substring(index1 + 1, index2)
}).collect()
//Iterate file group name to map of (fileGroup -> Dataframe of file group)
fileGroupList.map(fileGroupName => {
df.filter($"input_file".endsWith(s"${fileGroupName}.json"))
//perform dataframe operations
})
}
}
Check below code & Final result type is
scala.collection.immutable.Map[String,org.apache.spark.sql.DataFrame] = Map(tableBDF -> [...], tableADF -> [...], tableCDF -> [...]) where ... is your column list.
paths
.map(path => (s"${path.split("_").last.split("\\.json").head}DF",path)) // parsing file names and extracting table name and path into tuple
.groupBy(_._1) // grouping paths based same table name
.map(p => (p._1 -> p._2.map(_._2))).par // combining paths for same table into list and also .par function to execute subsequent steps in Parallel
.map(mp => {
(
mp._1, // table name
mp._2.par // For same DF multiple Files load parallel.
.map(spark.read.json(_)) // loading files s3
.reduce(_ union _) // union if same table has multiple files.
)
}
)

Converting mongoengine objects to JSON

i tried to fetch data from mongodb using mongoengine with flask. query is work perfect the problem is when i convert query result into json its show only fields name.
here is my code
view.py
from model import Users
result = Users.objects()
print(dumps(result))
model.py
class Users(DynamicDocument):
meta = {'collection' : 'users'}
user_name = StringField()
phone = StringField()
output
[["id", "user_name", "phone"], ["id", "user_name", "phone"]]
why its show only fields name ?
Your query returns a queryset. Use the .to_json() method to convert it.
Depending on what you need from there, you may want to use something like json.loads() to get a python dictionary.
For example:
from model import Users
# This returns <class 'mongoengine.queryset.queryset.QuerySet'>
q_set = Users.objects()
json_data = q_set.to_json()
# You might also find it useful to create python dictionaries
import json
dicts = json.loads(json_data)

How to get differences between two JSONs?

I want to compare 2 JSON and get all the differences between them in Scala. For example, I would like to compare:
{"a":"aa", "b": "bb", "c":"cc" }
and
{"c":"cc", "a":"aa", "d":"dd"}
I'd like to get b and d.
If it isn't a restriction you can use http://json4s.org/ it has a nice diff feature.
Follow an example based on question:
import org.json4s._
import org.json4s.native.JsonMethods._
val json1 = parse("""{"a":"aa", "b":"bb", "c":"cc"}""")
val json2 = parse("""{"c":"cc", "a":"aa", "d":"dd"}""")
val Diff(changed, added, deleted) = json1 diff json2
It will return:
changed: org.json4s.JsonAST.JValue = JNothing
added: org.json4s.JsonAST.JValue = JObject(List((d,JString(dd))))
deleted: org.json4s.JsonAST.JValue = JObject(List((b,JString(bb))))
Best regards
Finally, I used JSONassert that does the same thing.
For example,
String expected = "{id:1,name:\"Joe\",friends:[{id:2,name:\"Pat\",pets:[\"dog\"]},{id:3,name:\"Sue\",pets:[\"bird\",\"fish\"]}],pets:[]}";
String actual = "{id:1,name:\"Joe\",friends:[{id:2,name:\"Pat\",pets:[\"dog\"]},{id:3,name:\"Sue\",pets:[\"cat\",\"fish\"]}],pets:[]}"
JSONAssert.assertEquals(expected, actual, false);
it returns
friends[id=3].pets[]: Expected bird, but not found ; friends[id=3].pets[]: Contains cat, but not expected
source: http://jsonassert.skyscreamer.org/