Extract array from list of json strings using Spark - json

I have a column in my data frame which contains list of JSONs but the type is of String. I need to run explode on this column, so first I need to convert this into a list. I couldn't find much references to this use case.
Sample data:
columnName: "[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}, {...}]"
The above is how the data looks like, the fields are not fixed (index 0 might have JSON with some fields while index 1 will have fields with some other fields). In the list there can be more nested JSONs or some extra fields. I am currently using this -
"""explode(split(regexp_replace(regexp_replace(colName, '(\\\},)','}},'), '(\\\[|\\\])',''), "},")) as colName""" where I am just replacing "}," with "}}," then removing "[]" and then calling split on "}," but this approach doesn't work since there are nested JSONs.
How can I extract the array from the string?

You can try this way:
// Initial DataFrame
df.show(false)
+----------------------------------------------------------------------+
|columnName |
+----------------------------------------------------------------------+
|[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]|
+----------------------------------------------------------------------+
df.printSchema()
root
|-- columnName: string (nullable = true)
// toArray is a user defined function that parses an array of json objects which is present as a string
import org.json.JSONArray
val toArray = udf { (data: String) => {
val jsonArray = new JSONArray(data)
var arr: Array[String] = Array()
val objects = (0 until jsonArray.length).map(x => jsonArray.getJSONObject(x))
objects.foreach { elem =>
arr :+= elem.toString
}
arr
}
}
// Using the udf and exploding the resultant array
val df1 = df.withColumn("columnName",explode(toArray(col("columnName"))))
df1.show(false)
+-----------------------------------------------------+
|columnName |
+-----------------------------------------------------+
|{"other":7,"name":"a","info":{"grade":"b","age":"1"}}|
|{"random":"x"} |
+-----------------------------------------------------+
df1.printSchema()
root
|-- columnName: string (nullable = true)
// Parsing the json string by obtaining the schema dynamically
val schema = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).schema
val df2 = df1.withColumn("columnName",from_json(col("columnName"),schema))
df2.show(false)
+---------------+
|columnName |
+---------------+
|[[1, b], a, 7,]|
|[,,, x] |
+---------------+
df2.printSchema()
root
|-- columnName: struct (nullable = true)
| |-- info: struct (nullable = true)
| | |-- age: string (nullable = true)
| | |-- grade: string (nullable = true)
| |-- name: string (nullable = true)
| |-- other: long (nullable = true)
| |-- random: string (nullable = true)
// Extracting all the fields from the json
df2.select(col("columnName.*")).show(false)
+------+----+-----+------+
|info |name|other|random|
+------+----+-----+------+
|[1, b]|a |7 |null |
|null |null|null |x |
+------+----+-----+------+
Edit:
You can try this way if you can use get_json_object function
// Get the list of columns dynamically
val columns = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).columns
// define an empty array of Column type and get_json_object function to extract the columns
var extract_columns: Array[Column] = Array()
columns.foreach { column =>
extract_columns :+= get_json_object(col("columnName"), "$." + column).as(column)
}
df1.select(extract_columns: _*).show(false)
+-----------------------+----+-----+------+
|info |name|other|random|
+-----------------------+----+-----+------+
|{"grade":"b","age":"1"}|a |7 |null |
|null |null|null |x |
+-----------------------+----+-----+------+
Please note that info column is not of struct type. You may have to follow similar way to extract the columns of the nested json

val testString = """[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]"""
val ds = Seq(testString).toDS()
spark.read.json(ds)
.select("info.age", "info.grade","name","other","random")
.show(10,false)

Related

Reading an element from a json object stored in a column

I have the following dataframe
+-------+--------------------------------
|__key__|______value____________________|
| 1 | {"name":"John", "age": 34} |
| 2 | {"name":"Rose", "age": 50} |
I want to retrieve all age values within this dataframe and store it later within an array.
val x = df_clean.withColumn("value", col("value.age"))
x.show(false)
But this throws and exception.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Can't extract value from value#89: need struct type but got string;
How to resolve my requirement
EDIT
val schema = existingSparkSession.read.json(df_clean.select("value").as[String]).schema
val my_json = df_clean.select(from_json(col("value"), schema).alias("jsonValue"))
my_json.printSchema()
val df_final = my_json.withColumn("age", col("jsonValue.age"))
df_final.show(false)
Currently no exceptions are thrown. Yet I can't see any output also
EDIT 2
println("---+++++--------")
df_clean.select("value").take(1)
println("---+++++--------")
output
---+++++--------
---+++++--------
If you have long json and want to create schema then you can use from_json with schema.
import org.apache.spark.sql.functions._
val df = Seq(
(1, "{\"name\":\"John\", \"age\": 34}"),
(2, "{\"name\":\"Rose\", \"age\": 50}")
).toDF("key", "value")
val schema = spark.read.json(df.select("value").as[String]).schema
val resultDF = df.withColumn("value", from_json($"value", schema))
resultDF.show(false)
resultDF.printSchema()
Output:
+---+----------+
|key|value |
+---+----------+
|1 |{34, John}|
|2 |{50, Rose}|
+---+----------+
Schema:
root
|-- key: integer (nullable = false)
|-- value: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
If you directly need to access the nested fields then you can use get_json_object
df.withColumn("name", get_json_object($"value", "$.name"))
.withColumn("age", get_json_object($"value", "$.age"))
.show(false)

Parse JSON root in a column using Spark-Scala

I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records.
I've a data frame generated with a JSON similar the following:
val exampleJson = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"}
},
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
}
}""" :: Nil)
When I read it whit the following instruction
val itemsExample = spark.read.json(exampleJson)
The Schema and Data Frame generated is the following:
+-----------------------+-----------------------+
|ITEM1512 |ITEM1518 |
+-----------------------+-----------------------+
|[[Columbus, Ohio], Yin]|[[Working, Marc], Yang]|
+-----------------------+-----------------------+
root
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
But i want to generate something like this:
+-----------------------+-----------------------+
|Item |Values |
+-----------------------+-----------------------+
|ITEM1512 |[[Columbus, Ohio], Yin]|
|ITEM1518 |[[Working, Marc], Yang]|
+-----------------------+-----------------------+
So, in order to parse this JSON data I need to read all the columns and added it to a record in the Data Frame, because there are more than this two items that i write as example. In fact, there are millions of items that I'd like to add in a Data Frame.
I'm trying to replicate the solution found here in: How to parse the JSON data using Spark-Scala
with this code:
val columns:Array[String] = itemsExample.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("Item"),
col("element.E").as("Value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)
But I face with the problem while in the example reading in the other question the root is in array in my case the root is an StrucType. Therefore the next exception is thrown:
org.apache.spark.sql.AnalysisException: cannot resolve
'explode(ITEM1512)' due to data type mismatch: input to function
explode should be array or map type, not
struct,name:string>
You can use stack function.
Example:
itemsExample.selectExpr("""stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)""").
show(false)
//+--------+-----------------------+
//|Item |Values |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+
UPDATE:
Dynamic Stack query:
val stack=df.columns.map(x => s"'${x}',${x}").mkString(s"stack(${df.columns.size},",",",")as (Item,Values)")
//stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)
itemsExample.selectExpr(stack).show()
//+--------+-----------------------+
//|Item |Values |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

How to convert a DataFrame where all Columns are Strings into a DataFrame with a specific Schema

Imagine the following input:
val data = Seq (("1::Alice"), ("2::Bob"))
val dfInput = data.toDF("input")
val dfTwoColTypeString = dfInput.map(row => row.getString(0).split("::")).map{ case Array(id, name) => (id, name) }.toDF("id", "name")
Now I have a DataFrame with the columns as wished:
scala> dfTwoColTypeString.show
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
Of course I would like to have the column id of type int, but it is of type String:
scala> dfTwoColTypeString.printSchema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
Therefore I define this schema:
val mySchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true)
))
What is the best way to cast or convert the DataFrame dfTwoColTypeString to the given target schema.
Bonus: If the given input cannot be cast or converted to the target schema I would love to get a null row with an extra column "bad_record" containing the bad input data. That is, I want to accomplish the same, as the CSV parser in PERMISSIVE mode.
Any help really appreciated.
If conversion required when data are read, such code can be used:
val resultDF = mySchema.fields.foldLeft(dfTwoColTypeString)((df, c) => df.withColumn(c.name, col(c.name).cast(c.dataType)))
resultDF.printSchema()
Output:
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
For checking values match types, such code can be used:
val dfTwoColTypeString = dfInput.map(
row =>
row.getString(0).split("::"))
.map {
case Array(id, name) =>
if (ConvertUtils.canBeCasted((id, name), mySchema))
(id, name, null)
else (null, null, id + "::" + name)}
.toDF("id", "name", "malformed")
Two new static functions can be created in custom class (here ConvertUtils):
def canBeCasted(values: Product, mySchema: StructType): Boolean = {
mySchema.fields.zipWithIndex.forall(v => canBeCasted(values.productElement(v._2).asInstanceOf[String], v._1.dataType))
}
import scala.util.control.Exception.allCatch
def canBeCasted(value: String, dtype: DataType): Boolean = dtype match {
case StringType => true
case IntegerType => (allCatch opt value.toInt).isDefined
// TODO add other types here
case _ => false
}
Output with wrong "cc::Bob" value:
+----+-----+---------+
|id |name |malformed|
+----+-----+---------+
|1 |Alice|null |
|null|null |cc::Bob |
+----+-----+---------+
If CSV reading required, and schema is known, can be assigned during reading:
spark.read.schema(mySchema).csv("filename.csv")
val cols = Array(col("id").cast(IntegerType),col("name"))
dfTwoColTypeString.select(cols:_*).printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
//Another approach
import org.apache.spark.sql.types.{StringType,IntegerType,StructType,StructField}
val mySchema = StructType(Array(StructField("id", IntegerType, true),StructField("name", StringType, true)))
val df = spark.createDataFrame(dfTwoColTypeString.rdd,mySchema)
df.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
Considering dfTwoColTypeString to be a dataframe, you can also convert its schema type as below.
dfTwoColTypeString.withColumn("id", col("id").cast("Int"))

How to convert WrappedArray to string in spark?

I have a json file which contain nested array as like below,
| | |-- coordinates: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: array (containsNull = true)
| | | | | |-- element: array (containsNull = true)
| | | | | | |-- element: long (containsNull = true)
I have used Spark to read json and exploded the array.
explode(col("list_of_features.geometry.coordinates"))
which returns values as below,
WrappedArray(WrappedArray(WrappedArray(1271700, 6404100), WrappedArray(1271700, 6404200), WrappedArray(1271600, 6404200), WrappedArray(1271600, 6404300),....
But the original input looks like without WrappedArray.
something like,
[[[[1271700,6404100],[1271700, 6404200],[1271600, 6404200]
The ultimate aim is to store the coordinates without WrappedArray (may be as String) in csv file for Hive to read the data.
After explode is there any way to just the coordinates enclosed with proper square brackets.
Or can I use replace to replace the WrappedArray string value in RDD?
You can use UDF to flatten the WrappedArray and make it String value as
//udf
val concatArray = udf((value: Seq[Seq[Seq[Seq[Long]]]]) => {
value.flatten.flatten.flatten.mkString(",")
})
Now use udf to create/replace the column as
df1.withColumn("coordinates", concatArray($"coordinates") )
This should give you a string separated with "," replacing the WrappedArray
UPDATE: If you wan in the same format as string with brackets then you can do as
val concatArray = udf((value: Seq[Seq[Seq[Seq[Long]]]]) => {
value.map(_.map(_.map(_.mkString("[", ",", "]")).mkString("[", "", "]")).mkString("[", "", "]"))
})
Output:
[[[[1271700,6404100][1271700,6404200][1271600,6404200]]]]
Hope this helps!

Convert spark Dataframe with schema to dataframe of json String

I have a Dataframe like this:
+--+--------+--------+----+-------------+------------------------------+
|id|name |lastname|age |timestamp |creditcards |
+--+--------+--------+----+-------------+------------------------------+
|1 |michel |blanc |35 |1496756626921|[[hr6,3569823], [ee3,1547869]]|
|2 |peter |barns |25 |1496756626551|[[ye8,4569872], [qe5,3485762]]|
+--+--------+--------+----+-------------+------------------------------+
where the schema of my df is like below:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- lastname: string (nullable = true)
|-- age: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- creditcards: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- number: string (nullable = true)
I would like to convert each line to a json string knowing my schema. So this dataframe would have one column string containing the json.
first line should be like this:
{
"id":"1",
"name":"michel",
"lastname":"blanc",
"age":"35",
"timestamp":"1496756626921",
"creditcards":[
{
"id":"hr6",
"number":"3569823"
},
{
"id":"ee3",
"number":"1547869"
}
]
}
and the secone line of the dataframe should be like this:
{
"id":"2",
"name":"peter",
"lastname":"barns",
"age":"25",
"timestamp":"1496756626551",
"creditcards":[
{
"id":"ye8",
"number":"4569872"
},
{
"id":"qe5",
"number":"3485762"
}
]
}
my goal is not to write the dataframe to json file. My goal is to convert df1 to a second df2 in order to push each json line of df2 to kafka topic
I have this code to create the dataframe:
val line1 = """{"id":"1","name":"michel","lastname":"blanc","age":"35","timestamp":"1496756626921","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}]}"""
val line2 = """{"id":"2","name":"peter","lastname":"barns","age":"25","timestamp":"1496756626551","creditcards":[{"id":"ye8","number":"4569872"}, {"id":"qe5","number":"3485762"}]}"""
val rdd = sc.parallelize(Seq(line1, line2))
val df = sqlContext.read.json(rdd)
df show false
df printSchema
Do you have any idea?
If all you need is a single-column DataFrame/Dataset with each column value representing each row of the original DataFrame in JSON, you can simply apply toJSON to your DataFrame, as in the following:
df.show
// +---+------------------------------+---+--------+------+-------------+
// |age|creditcards |id |lastname|name |timestamp |
// +---+------------------------------+---+--------+------+-------------+
// |35 |[[hr6,3569823], [ee3,1547869]]|1 |blanc |michel|1496756626921|
// |25 |[[ye8,4569872], [qe5,3485762]]|2 |barns |peter |1496756626551|
// +---+------------------------------+---+--------+------+-------------+
val dsJson = df.toJSON
// dsJson: org.apache.spark.sql.Dataset[String] = [value: string]
dsJson.show
// +--------------------------------------------------------------------------+
// |value |
// +--------------------------------------------------------------------------+
// |{"age":"35","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3",...|
// |{"age":"25","creditcards":[{"id":"ye8","number":"4569872"},{"id":"qe5",...|
// +--------------------------------------------------------------------------+
[UPDATE]
To add name as an additional column, you can extract it from the JSON column using from_json:
val result = dsJson.withColumn("name", from_json($"value", df.schema)("name"))
result.show
// +--------------------+------+
// | value| name|
// +--------------------+------+
// |{"age":"35","cred...|michel|
// |{"age":"25","cred...| peter|
// +--------------------+------+
For that, you can directly convert your dataframe to a Dataset of JSON string using
val jsonDataset: Dataset[String] = df.toJSON
You can convert it into a dataframe using
val jsonDF: DataFrame = jsonDataset.toDF
Here the json will be alphabetically ordered so the output of
jsonDF show false
will be
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"age":"35","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}],"id":"1","lastname":"blanc","name":"michel","timestamp":"1496756626921"}|
|{"age":"25","creditcards":[{"id":"ye8","number":"4569872"},{"id":"qe5","number":"3485762"}],"id":"2","lastname":"barns","name":"peter","timestamp":"1496756626551"} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+