How to convert WrappedArray to string in spark? - json

I have a json file which contain nested array as like below,
| | |-- coordinates: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: array (containsNull = true)
| | | | | |-- element: array (containsNull = true)
| | | | | | |-- element: long (containsNull = true)
I have used Spark to read json and exploded the array.
explode(col("list_of_features.geometry.coordinates"))
which returns values as below,
WrappedArray(WrappedArray(WrappedArray(1271700, 6404100), WrappedArray(1271700, 6404200), WrappedArray(1271600, 6404200), WrappedArray(1271600, 6404300),....
But the original input looks like without WrappedArray.
something like,
[[[[1271700,6404100],[1271700, 6404200],[1271600, 6404200]
The ultimate aim is to store the coordinates without WrappedArray (may be as String) in csv file for Hive to read the data.
After explode is there any way to just the coordinates enclosed with proper square brackets.
Or can I use replace to replace the WrappedArray string value in RDD?

You can use UDF to flatten the WrappedArray and make it String value as
//udf
val concatArray = udf((value: Seq[Seq[Seq[Seq[Long]]]]) => {
value.flatten.flatten.flatten.mkString(",")
})
Now use udf to create/replace the column as
df1.withColumn("coordinates", concatArray($"coordinates") )
This should give you a string separated with "," replacing the WrappedArray
UPDATE: If you wan in the same format as string with brackets then you can do as
val concatArray = udf((value: Seq[Seq[Seq[Seq[Long]]]]) => {
value.map(_.map(_.map(_.mkString("[", ",", "]")).mkString("[", "", "]")).mkString("[", "", "]"))
})
Output:
[[[[1271700,6404100][1271700,6404200][1271600,6404200]]]]
Hope this helps!

Related

Transform a list of JSON string to a list of dict in Pyspark

I’m struggling to transform a list of JSON string to a list of dict in Pyspark without using udf or using rdd
I have this kind of dataframe:
Key
JSON_string
123456
["""{"Zipcode":704,"ZipCodeType":"STA"}""","""{"City":"PARC","State":"PR"}"""]
789123
["""{"Zipcode":7,"ZipCodeType":"AZA"}""","""{"City":"PRE","State":"XY"}"""]
How can I transform col(JSON_string) by using built-in functions in Pyspark to [{"Zipcode":704,"ZipCodeType":"STA"},{"City":"PARC","State":"PR"}] ?
I tried many functions such as create_map, collect_list, from_json, to_json, explode, json.loads, json.dump but no way to get the expected result.
Thank you for your help
Explode your JSON_string column, and read it as json, group by again.
df = df.withColumn('JSON_string', f.explode('JSON_string'))
schema = spark.read.json(df.rdd.map(lambda r: r.JSON_string)).schema
df_result = df.withColumn('JSON', f.from_json('JSON_string', schema)) \
.drop('JSON_string') \
.groupBy('Key') \
.agg(f.collect_list('JSON').alias('JSON'))
df_result.show(truncate=False)
df_result.printSchema()
+------+------------------------------------------------+
|Key |JSON |
+------+------------------------------------------------+
|123456|[{null, null, STA, 704}, {PARC, PR, null, null}]|
|789123|[{null, null, AZA, 7}, {PRE, XY, null, null}] |
+------+------------------------------------------------+
root
|-- Key: long (nullable = true)
|-- JSON: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- City: string (nullable = true)
| | |-- State: string (nullable = true)
| | |-- ZipCodeType: string (nullable = true)
| | |-- Zipcode: long (nullable = true)

PySpark, parse json given deep nested schema

I have a pyspark dataframe where, for each row, there is a column which is a json string.
+----------------------------------------------------------------------------+
|data |
+----------------------------------------------------------------------------+
|{"student":{"name":"Bob","surname":"Smith","age":18},"scholarship":true} |
|{"student":{"name":"Adam","surname":"Smith","age":"23"},"scholarship":false}|
+----------------------------------------------------------------------------+
I want to explode this json strings, in order to be compliant with the following schema:
root
|-- scholarship: boolean (nullable = true)
|-- student: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
| |-- surname: string (nullable = true)
So, my solution is the following:
my_schema = StructType([
StructField("scholarship",BooleanType(),True),
StructField(
"student",
StructType([
StructField("age",LongType(),True),
StructField("name",StringType(),True),
StructField("surname",StringType(),True)
]),
True
)
])
parsed_df = my_df.withColumn("data", from_json(col("data"), my_schema))
In this way, the parsed_df is the following:
+------------------------+
|data |
+------------------------+
|{true, {18, Bob, Smith}}|
|{false, null} |
+------------------------+
Instead, I would like an output as:
+----------------------------+
|data |
+----------------------------+
|{true, {18, Bob, Smith}} |
|{false, {null, Adam, Smith}}|
+----------------------------+
Is there any option in the from_json method or any alternative solution to reach this result?
I add that I cannot use databricks and also that (unlike the example) in the business use, I don't define the schema, but this is passed to me every time. My question is more general: given a spark schema, is there any way to parse a json string column in a dataframe, although this json is deeply nested?

Extract array from list of json strings using Spark

I have a column in my data frame which contains list of JSONs but the type is of String. I need to run explode on this column, so first I need to convert this into a list. I couldn't find much references to this use case.
Sample data:
columnName: "[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}, {...}]"
The above is how the data looks like, the fields are not fixed (index 0 might have JSON with some fields while index 1 will have fields with some other fields). In the list there can be more nested JSONs or some extra fields. I am currently using this -
"""explode(split(regexp_replace(regexp_replace(colName, '(\\\},)','}},'), '(\\\[|\\\])',''), "},")) as colName""" where I am just replacing "}," with "}}," then removing "[]" and then calling split on "}," but this approach doesn't work since there are nested JSONs.
How can I extract the array from the string?
You can try this way:
// Initial DataFrame
df.show(false)
+----------------------------------------------------------------------+
|columnName |
+----------------------------------------------------------------------+
|[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]|
+----------------------------------------------------------------------+
df.printSchema()
root
|-- columnName: string (nullable = true)
// toArray is a user defined function that parses an array of json objects which is present as a string
import org.json.JSONArray
val toArray = udf { (data: String) => {
val jsonArray = new JSONArray(data)
var arr: Array[String] = Array()
val objects = (0 until jsonArray.length).map(x => jsonArray.getJSONObject(x))
objects.foreach { elem =>
arr :+= elem.toString
}
arr
}
}
// Using the udf and exploding the resultant array
val df1 = df.withColumn("columnName",explode(toArray(col("columnName"))))
df1.show(false)
+-----------------------------------------------------+
|columnName |
+-----------------------------------------------------+
|{"other":7,"name":"a","info":{"grade":"b","age":"1"}}|
|{"random":"x"} |
+-----------------------------------------------------+
df1.printSchema()
root
|-- columnName: string (nullable = true)
// Parsing the json string by obtaining the schema dynamically
val schema = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).schema
val df2 = df1.withColumn("columnName",from_json(col("columnName"),schema))
df2.show(false)
+---------------+
|columnName |
+---------------+
|[[1, b], a, 7,]|
|[,,, x] |
+---------------+
df2.printSchema()
root
|-- columnName: struct (nullable = true)
| |-- info: struct (nullable = true)
| | |-- age: string (nullable = true)
| | |-- grade: string (nullable = true)
| |-- name: string (nullable = true)
| |-- other: long (nullable = true)
| |-- random: string (nullable = true)
// Extracting all the fields from the json
df2.select(col("columnName.*")).show(false)
+------+----+-----+------+
|info |name|other|random|
+------+----+-----+------+
|[1, b]|a |7 |null |
|null |null|null |x |
+------+----+-----+------+
Edit:
You can try this way if you can use get_json_object function
// Get the list of columns dynamically
val columns = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).columns
// define an empty array of Column type and get_json_object function to extract the columns
var extract_columns: Array[Column] = Array()
columns.foreach { column =>
extract_columns :+= get_json_object(col("columnName"), "$." + column).as(column)
}
df1.select(extract_columns: _*).show(false)
+-----------------------+----+-----+------+
|info |name|other|random|
+-----------------------+----+-----+------+
|{"grade":"b","age":"1"}|a |7 |null |
|null |null|null |x |
+-----------------------+----+-----+------+
Please note that info column is not of struct type. You may have to follow similar way to extract the columns of the nested json
val testString = """[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]"""
val ds = Seq(testString).toDS()
spark.read.json(ds)
.select("info.age", "info.grade","name","other","random")
.show(10,false)

Reading Nested JSON via Spark SQL - [AnalysisException] cannot resolve Column

I have a JSON data like this:
{
"parent":[
{
"prop1":1.0,
"prop2":"C",
"children":[
{
"child_prop1":[
"3026"
]
}
]
}
]
}
After reading data from Spark I get following schema:
val df = spark.read.json("test.json")
df.printSchema
root
|-- parent: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- children: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- child_prop1: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | |-- prop1: double (nullable = true)
| | |-- prop2: string (nullable = true)
Now, I want to select child_prop1 from df. But when I try to select it I get org.apache.spark.sql.AnalysisException. Something like this:
df.select("parent.children.child_prop1")
org.apache.spark.sql.AnalysisException: cannot resolve '`parent`.`children`['child_prop1']' due to data type mismatch: argument 2 requires integral type, however, ''child_prop1'' is of string type.;;
'Project [parent#60.children[child_prop1] AS child_prop1#63]
+- Relation[parent#60] json
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
... 48 elided
Although, when I select only children from df it works fine.
df.select("parent.children").show(false)
+------------------------------------+
|children |
+------------------------------------+
|[WrappedArray([WrappedArray(3026)])]|
+------------------------------------+
I cannot understand why it is giving exception even though the column is present in dataframe.
Any help is appreciated !
Your Json is a valid json which and I think you don't need to change your input data.
Use explode to get the data as
import org.apache.spark.sql.functions.explode
val data = spark.read.json("src/test/java/data.json")
val child = data.select(explode(data("parent.children"))).toDF("children")
child.select(explode(child("children.child_prop1"))).toDF("child_prop1").show()
If you can change the input data you can follow #ramesh suggestions
If you look at the schema child_prop1 is inside nested array of root array parent. So we need to be able to define the position of the child_prop1 and thats what the error is suggesting you to define.
Converting your json format should do the trick.
Changing the json to
{"parent":{"prop1":1.0,"prop2":"C","children":{"child_prop1":["3026"]}}}
and applying the
df.select("parent.children.child_prop1").show(false)
will give output as
+-----------+
|child_prop1|
+-----------+
|[3026] |
+-----------+
And
Changing the json to
{"parent":{"prop1":1.0,"prop2":"C","children":[{"child_prop1":["3026"]}]}}
and applying the
df.select("parent.children.child_prop1").show(false)
will result
+--------------------+
|child_prop1 |
+--------------------+
|[WrappedArray(3026)]|
+--------------------+
I hope the answer helps

How to parse nested JSON objects in spark sql?

I have a schema as shown below. How can i parse the nested objects
root
|-- apps: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- appName: string (nullable = true)
| | |-- appPackage: string (nullable = true)
| | |-- Ratings: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- date: string (nullable = true)
| | | | |-- rating: long (nullable = true)
|-- id: string (nullable = true)
Assuming you read in a json file and print the schema you are showing us like this:
DataFrame df = sqlContext.read().json("/path/to/file").toDF();
df.registerTempTable("df");
df.printSchema();
Then you can select nested objects inside a struct type like so...
DataFrame app = df.select("app");
app.registerTempTable("app");
app.printSchema();
app.show();
DataFrame appName = app.select("element.appName");
appName.registerTempTable("appName");
appName.printSchema();
appName.show();
Try this:
val nameAndAddress = sqlContext.sql("""
SELECT name, address.city, address.state
FROM people
""")
nameAndAddress.collect.foreach(println)
Source:
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
Have you tried doing it straight from the SQL query like
Select apps.element.Ratings from yourTableName
This will probably return an array and you can more easily access the elements inside.
Also, I use this online Json viewer when I have to deal with large JSON structures and the schema is too complex:
http://jsonviewer.stack.hu/
I am using pyspark, but the logic should be similar.
I found this way of parsing my nested json useful:
df.select(df.apps.appName.alias("apps_Name"), \
df.apps.appPackage.alias("apps_Package"), \
df.apps.Ratings.date.alias("apps_Ratings_date")) \
.show()
The code could be obviously shorten with a f-string.
var df = spark.read.format("json").load("/path/to/file")
df.createOrReplaceTempView("df");
spark.sql("select apps.element.Ratings from df where apps.element.appName like '%app_name%' ").show()