Reading an element from a json object stored in a column - json

I have the following dataframe
+-------+--------------------------------
|__key__|______value____________________|
| 1 | {"name":"John", "age": 34} |
| 2 | {"name":"Rose", "age": 50} |
I want to retrieve all age values within this dataframe and store it later within an array.
val x = df_clean.withColumn("value", col("value.age"))
x.show(false)
But this throws and exception.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Can't extract value from value#89: need struct type but got string;
How to resolve my requirement
EDIT
val schema = existingSparkSession.read.json(df_clean.select("value").as[String]).schema
val my_json = df_clean.select(from_json(col("value"), schema).alias("jsonValue"))
my_json.printSchema()
val df_final = my_json.withColumn("age", col("jsonValue.age"))
df_final.show(false)
Currently no exceptions are thrown. Yet I can't see any output also
EDIT 2
println("---+++++--------")
df_clean.select("value").take(1)
println("---+++++--------")
output
---+++++--------
---+++++--------

If you have long json and want to create schema then you can use from_json with schema.
import org.apache.spark.sql.functions._
val df = Seq(
(1, "{\"name\":\"John\", \"age\": 34}"),
(2, "{\"name\":\"Rose\", \"age\": 50}")
).toDF("key", "value")
val schema = spark.read.json(df.select("value").as[String]).schema
val resultDF = df.withColumn("value", from_json($"value", schema))
resultDF.show(false)
resultDF.printSchema()
Output:
+---+----------+
|key|value |
+---+----------+
|1 |{34, John}|
|2 |{50, Rose}|
+---+----------+
Schema:
root
|-- key: integer (nullable = false)
|-- value: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
If you directly need to access the nested fields then you can use get_json_object
df.withColumn("name", get_json_object($"value", "$.name"))
.withColumn("age", get_json_object($"value", "$.age"))
.show(false)

Related

Extract array from list of json strings using Spark

I have a column in my data frame which contains list of JSONs but the type is of String. I need to run explode on this column, so first I need to convert this into a list. I couldn't find much references to this use case.
Sample data:
columnName: "[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}, {...}]"
The above is how the data looks like, the fields are not fixed (index 0 might have JSON with some fields while index 1 will have fields with some other fields). In the list there can be more nested JSONs or some extra fields. I am currently using this -
"""explode(split(regexp_replace(regexp_replace(colName, '(\\\},)','}},'), '(\\\[|\\\])',''), "},")) as colName""" where I am just replacing "}," with "}}," then removing "[]" and then calling split on "}," but this approach doesn't work since there are nested JSONs.
How can I extract the array from the string?
You can try this way:
// Initial DataFrame
df.show(false)
+----------------------------------------------------------------------+
|columnName |
+----------------------------------------------------------------------+
|[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]|
+----------------------------------------------------------------------+
df.printSchema()
root
|-- columnName: string (nullable = true)
// toArray is a user defined function that parses an array of json objects which is present as a string
import org.json.JSONArray
val toArray = udf { (data: String) => {
val jsonArray = new JSONArray(data)
var arr: Array[String] = Array()
val objects = (0 until jsonArray.length).map(x => jsonArray.getJSONObject(x))
objects.foreach { elem =>
arr :+= elem.toString
}
arr
}
}
// Using the udf and exploding the resultant array
val df1 = df.withColumn("columnName",explode(toArray(col("columnName"))))
df1.show(false)
+-----------------------------------------------------+
|columnName |
+-----------------------------------------------------+
|{"other":7,"name":"a","info":{"grade":"b","age":"1"}}|
|{"random":"x"} |
+-----------------------------------------------------+
df1.printSchema()
root
|-- columnName: string (nullable = true)
// Parsing the json string by obtaining the schema dynamically
val schema = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).schema
val df2 = df1.withColumn("columnName",from_json(col("columnName"),schema))
df2.show(false)
+---------------+
|columnName |
+---------------+
|[[1, b], a, 7,]|
|[,,, x] |
+---------------+
df2.printSchema()
root
|-- columnName: struct (nullable = true)
| |-- info: struct (nullable = true)
| | |-- age: string (nullable = true)
| | |-- grade: string (nullable = true)
| |-- name: string (nullable = true)
| |-- other: long (nullable = true)
| |-- random: string (nullable = true)
// Extracting all the fields from the json
df2.select(col("columnName.*")).show(false)
+------+----+-----+------+
|info |name|other|random|
+------+----+-----+------+
|[1, b]|a |7 |null |
|null |null|null |x |
+------+----+-----+------+
Edit:
You can try this way if you can use get_json_object function
// Get the list of columns dynamically
val columns = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).columns
// define an empty array of Column type and get_json_object function to extract the columns
var extract_columns: Array[Column] = Array()
columns.foreach { column =>
extract_columns :+= get_json_object(col("columnName"), "$." + column).as(column)
}
df1.select(extract_columns: _*).show(false)
+-----------------------+----+-----+------+
|info |name|other|random|
+-----------------------+----+-----+------+
|{"grade":"b","age":"1"}|a |7 |null |
|null |null|null |x |
+-----------------------+----+-----+------+
Please note that info column is not of struct type. You may have to follow similar way to extract the columns of the nested json
val testString = """[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]"""
val ds = Seq(testString).toDS()
spark.read.json(ds)
.select("info.age", "info.grade","name","other","random")
.show(10,false)

How to convert a DataFrame where all Columns are Strings into a DataFrame with a specific Schema

Imagine the following input:
val data = Seq (("1::Alice"), ("2::Bob"))
val dfInput = data.toDF("input")
val dfTwoColTypeString = dfInput.map(row => row.getString(0).split("::")).map{ case Array(id, name) => (id, name) }.toDF("id", "name")
Now I have a DataFrame with the columns as wished:
scala> dfTwoColTypeString.show
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
Of course I would like to have the column id of type int, but it is of type String:
scala> dfTwoColTypeString.printSchema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
Therefore I define this schema:
val mySchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true)
))
What is the best way to cast or convert the DataFrame dfTwoColTypeString to the given target schema.
Bonus: If the given input cannot be cast or converted to the target schema I would love to get a null row with an extra column "bad_record" containing the bad input data. That is, I want to accomplish the same, as the CSV parser in PERMISSIVE mode.
Any help really appreciated.
If conversion required when data are read, such code can be used:
val resultDF = mySchema.fields.foldLeft(dfTwoColTypeString)((df, c) => df.withColumn(c.name, col(c.name).cast(c.dataType)))
resultDF.printSchema()
Output:
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
For checking values match types, such code can be used:
val dfTwoColTypeString = dfInput.map(
row =>
row.getString(0).split("::"))
.map {
case Array(id, name) =>
if (ConvertUtils.canBeCasted((id, name), mySchema))
(id, name, null)
else (null, null, id + "::" + name)}
.toDF("id", "name", "malformed")
Two new static functions can be created in custom class (here ConvertUtils):
def canBeCasted(values: Product, mySchema: StructType): Boolean = {
mySchema.fields.zipWithIndex.forall(v => canBeCasted(values.productElement(v._2).asInstanceOf[String], v._1.dataType))
}
import scala.util.control.Exception.allCatch
def canBeCasted(value: String, dtype: DataType): Boolean = dtype match {
case StringType => true
case IntegerType => (allCatch opt value.toInt).isDefined
// TODO add other types here
case _ => false
}
Output with wrong "cc::Bob" value:
+----+-----+---------+
|id |name |malformed|
+----+-----+---------+
|1 |Alice|null |
|null|null |cc::Bob |
+----+-----+---------+
If CSV reading required, and schema is known, can be assigned during reading:
spark.read.schema(mySchema).csv("filename.csv")
val cols = Array(col("id").cast(IntegerType),col("name"))
dfTwoColTypeString.select(cols:_*).printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
//Another approach
import org.apache.spark.sql.types.{StringType,IntegerType,StructType,StructField}
val mySchema = StructType(Array(StructField("id", IntegerType, true),StructField("name", StringType, true)))
val df = spark.createDataFrame(dfTwoColTypeString.rdd,mySchema)
df.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
Considering dfTwoColTypeString to be a dataframe, you can also convert its schema type as below.
dfTwoColTypeString.withColumn("id", col("id").cast("Int"))

Spark: correct schema to load JSON as DataFrame

I have a JSON like
{ 1234 : "blah1", 9807: "blah2", 467: "blah_k", ...}
written to a gzipped file. It is a mapping of one ID space to another where the keys are ints and values are strings.
I want to load it as a DataFrame in Spark.
I loaded it as,
val df = spark.read.format("json").load("my_id_file.json.gz")
By default, Spark loaded it with a schema that looks like
|-- 1234: string (nullable = true)
|-- 9807: string (nullable = true)
|-- 467: string (nullable = true)
Instead, I want to my DataFrame to look like
+----+------+
|id1 |id2 |
+----+------+
|1234|blah1 |
|9007|blah2 |
|467 |blah_k|
+----+------+
So, I tried the following.
import org.apache.spark.sql.types._
val idMapSchema = StructType(Array(StructField("id1", IntegerType, true), StructField("id2", StringType, true)))
val df = spark.read.format("json").schema(idMapSchema).load("my_id_file.json.gz")
However, the loaded data frame looks like
scala> df.show
+----+----+
|id1 |id2 |
+----+----+
|null|null|
+----+----+
How can I specify the schema to fix this? Is there a "pure" dataframe approach (without creating an RDD and then creating DataFrame)?
One way to achieve this is to read the input file as textFile and apply your parsing logic within map() and then convert the result to dataframe
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
val rdd = sparkSession.sparkContext.textFile("my_input_file_path")
.map(row => {
val list = new ListBuffer[String]()
val inputJson = new JSONObject(row)
for (key <- inputJson.keySet()) {
val resultJson = new JSONObject()
resultJson.put("col1", key)
resultJson.put("col2", inputJson.get(key))
list += resultJson.toString()
}
list
}).flatMap(row => row)
val df = sparkSession.read.json(rdd)
df.printSchema()
df.show(false)
output:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
+----+------+
|col1|col2 |
+----+------+
|1234|blah1 |
|467 |blah_k|
|9807|blah2 |
+----+------+

Fit a json string to a DataFrame using a schema

I have a schema that looks like this:
StructType(StructField(keys,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I have a json string(that matches this schema) that I need to convert to fit the above schema.
"{"keys" : [2.0, 1.0]}"
How I proceed to get a dataframe out of this string to get a DataFrame that matches my schema?
Following are the steps I have tried in a scala notebook:
val rddData2 = sc.parallelize("""{"keys" : [1.0 , 2.0] }""" :: Nil)
val in = session.read.schema(schema).json(rddData2)
in.show
This is the output being shown:
+-----------+
|keys |
+-----------+
|null |
+-----------+
If you have a json string as
val jsonString = """{"keys" : [2.0, 1.0]}"""
then you can create a dataframe without schema as
val jsonRdd = sc.parallelize(Seq(jsonString))
val df = sqlContext.read.json(jsonRdd)
which should give you
+----------+
|keys |
+----------+
|[2.0, 1.0]|
+----------+
with schema
root
|-- keys: array (nullable = true)
| |-- element: double (containsNull = true)
Now if you want to convert the array column created by default to Vector, then you would need a udf function as
import org.apache.spark.sql.functions._
def vectorUdf = udf((array: collection.mutable.WrappedArray[Double]) => org.apache.spark.ml.linalg.Vectors.dense(Array(array: _*)))
and call the udf function using .withColumn as
df.withColumn("keys", vectorUdf(col("keys")))
You should be getting dataframe with schema as
root
|-- keys: vector (nullable = true)

Convert spark Dataframe with schema to dataframe of json String

I have a Dataframe like this:
+--+--------+--------+----+-------------+------------------------------+
|id|name |lastname|age |timestamp |creditcards |
+--+--------+--------+----+-------------+------------------------------+
|1 |michel |blanc |35 |1496756626921|[[hr6,3569823], [ee3,1547869]]|
|2 |peter |barns |25 |1496756626551|[[ye8,4569872], [qe5,3485762]]|
+--+--------+--------+----+-------------+------------------------------+
where the schema of my df is like below:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- lastname: string (nullable = true)
|-- age: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- creditcards: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- number: string (nullable = true)
I would like to convert each line to a json string knowing my schema. So this dataframe would have one column string containing the json.
first line should be like this:
{
"id":"1",
"name":"michel",
"lastname":"blanc",
"age":"35",
"timestamp":"1496756626921",
"creditcards":[
{
"id":"hr6",
"number":"3569823"
},
{
"id":"ee3",
"number":"1547869"
}
]
}
and the secone line of the dataframe should be like this:
{
"id":"2",
"name":"peter",
"lastname":"barns",
"age":"25",
"timestamp":"1496756626551",
"creditcards":[
{
"id":"ye8",
"number":"4569872"
},
{
"id":"qe5",
"number":"3485762"
}
]
}
my goal is not to write the dataframe to json file. My goal is to convert df1 to a second df2 in order to push each json line of df2 to kafka topic
I have this code to create the dataframe:
val line1 = """{"id":"1","name":"michel","lastname":"blanc","age":"35","timestamp":"1496756626921","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}]}"""
val line2 = """{"id":"2","name":"peter","lastname":"barns","age":"25","timestamp":"1496756626551","creditcards":[{"id":"ye8","number":"4569872"}, {"id":"qe5","number":"3485762"}]}"""
val rdd = sc.parallelize(Seq(line1, line2))
val df = sqlContext.read.json(rdd)
df show false
df printSchema
Do you have any idea?
If all you need is a single-column DataFrame/Dataset with each column value representing each row of the original DataFrame in JSON, you can simply apply toJSON to your DataFrame, as in the following:
df.show
// +---+------------------------------+---+--------+------+-------------+
// |age|creditcards |id |lastname|name |timestamp |
// +---+------------------------------+---+--------+------+-------------+
// |35 |[[hr6,3569823], [ee3,1547869]]|1 |blanc |michel|1496756626921|
// |25 |[[ye8,4569872], [qe5,3485762]]|2 |barns |peter |1496756626551|
// +---+------------------------------+---+--------+------+-------------+
val dsJson = df.toJSON
// dsJson: org.apache.spark.sql.Dataset[String] = [value: string]
dsJson.show
// +--------------------------------------------------------------------------+
// |value |
// +--------------------------------------------------------------------------+
// |{"age":"35","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3",...|
// |{"age":"25","creditcards":[{"id":"ye8","number":"4569872"},{"id":"qe5",...|
// +--------------------------------------------------------------------------+
[UPDATE]
To add name as an additional column, you can extract it from the JSON column using from_json:
val result = dsJson.withColumn("name", from_json($"value", df.schema)("name"))
result.show
// +--------------------+------+
// | value| name|
// +--------------------+------+
// |{"age":"35","cred...|michel|
// |{"age":"25","cred...| peter|
// +--------------------+------+
For that, you can directly convert your dataframe to a Dataset of JSON string using
val jsonDataset: Dataset[String] = df.toJSON
You can convert it into a dataframe using
val jsonDF: DataFrame = jsonDataset.toDF
Here the json will be alphabetically ordered so the output of
jsonDF show false
will be
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"age":"35","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}],"id":"1","lastname":"blanc","name":"michel","timestamp":"1496756626921"}|
|{"age":"25","creditcards":[{"id":"ye8","number":"4569872"},{"id":"qe5","number":"3485762"}],"id":"2","lastname":"barns","name":"peter","timestamp":"1496756626551"} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+