I have a schema that looks like this:
StructType(StructField(keys,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I have a json string(that matches this schema) that I need to convert to fit the above schema.
"{"keys" : [2.0, 1.0]}"
How I proceed to get a dataframe out of this string to get a DataFrame that matches my schema?
Following are the steps I have tried in a scala notebook:
val rddData2 = sc.parallelize("""{"keys" : [1.0 , 2.0] }""" :: Nil)
val in = session.read.schema(schema).json(rddData2)
in.show
This is the output being shown:
+-----------+
|keys |
+-----------+
|null |
+-----------+
If you have a json string as
val jsonString = """{"keys" : [2.0, 1.0]}"""
then you can create a dataframe without schema as
val jsonRdd = sc.parallelize(Seq(jsonString))
val df = sqlContext.read.json(jsonRdd)
which should give you
+----------+
|keys |
+----------+
|[2.0, 1.0]|
+----------+
with schema
root
|-- keys: array (nullable = true)
| |-- element: double (containsNull = true)
Now if you want to convert the array column created by default to Vector, then you would need a udf function as
import org.apache.spark.sql.functions._
def vectorUdf = udf((array: collection.mutable.WrappedArray[Double]) => org.apache.spark.ml.linalg.Vectors.dense(Array(array: _*)))
and call the udf function using .withColumn as
df.withColumn("keys", vectorUdf(col("keys")))
You should be getting dataframe with schema as
root
|-- keys: vector (nullable = true)
Related
I have a column in my data frame which contains list of JSONs but the type is of String. I need to run explode on this column, so first I need to convert this into a list. I couldn't find much references to this use case.
Sample data:
columnName: "[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}, {...}]"
The above is how the data looks like, the fields are not fixed (index 0 might have JSON with some fields while index 1 will have fields with some other fields). In the list there can be more nested JSONs or some extra fields. I am currently using this -
"""explode(split(regexp_replace(regexp_replace(colName, '(\\\},)','}},'), '(\\\[|\\\])',''), "},")) as colName""" where I am just replacing "}," with "}}," then removing "[]" and then calling split on "}," but this approach doesn't work since there are nested JSONs.
How can I extract the array from the string?
You can try this way:
// Initial DataFrame
df.show(false)
+----------------------------------------------------------------------+
|columnName |
+----------------------------------------------------------------------+
|[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]|
+----------------------------------------------------------------------+
df.printSchema()
root
|-- columnName: string (nullable = true)
// toArray is a user defined function that parses an array of json objects which is present as a string
import org.json.JSONArray
val toArray = udf { (data: String) => {
val jsonArray = new JSONArray(data)
var arr: Array[String] = Array()
val objects = (0 until jsonArray.length).map(x => jsonArray.getJSONObject(x))
objects.foreach { elem =>
arr :+= elem.toString
}
arr
}
}
// Using the udf and exploding the resultant array
val df1 = df.withColumn("columnName",explode(toArray(col("columnName"))))
df1.show(false)
+-----------------------------------------------------+
|columnName |
+-----------------------------------------------------+
|{"other":7,"name":"a","info":{"grade":"b","age":"1"}}|
|{"random":"x"} |
+-----------------------------------------------------+
df1.printSchema()
root
|-- columnName: string (nullable = true)
// Parsing the json string by obtaining the schema dynamically
val schema = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).schema
val df2 = df1.withColumn("columnName",from_json(col("columnName"),schema))
df2.show(false)
+---------------+
|columnName |
+---------------+
|[[1, b], a, 7,]|
|[,,, x] |
+---------------+
df2.printSchema()
root
|-- columnName: struct (nullable = true)
| |-- info: struct (nullable = true)
| | |-- age: string (nullable = true)
| | |-- grade: string (nullable = true)
| |-- name: string (nullable = true)
| |-- other: long (nullable = true)
| |-- random: string (nullable = true)
// Extracting all the fields from the json
df2.select(col("columnName.*")).show(false)
+------+----+-----+------+
|info |name|other|random|
+------+----+-----+------+
|[1, b]|a |7 |null |
|null |null|null |x |
+------+----+-----+------+
Edit:
You can try this way if you can use get_json_object function
// Get the list of columns dynamically
val columns = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).columns
// define an empty array of Column type and get_json_object function to extract the columns
var extract_columns: Array[Column] = Array()
columns.foreach { column =>
extract_columns :+= get_json_object(col("columnName"), "$." + column).as(column)
}
df1.select(extract_columns: _*).show(false)
+-----------------------+----+-----+------+
|info |name|other|random|
+-----------------------+----+-----+------+
|{"grade":"b","age":"1"}|a |7 |null |
|null |null|null |x |
+-----------------------+----+-----+------+
Please note that info column is not of struct type. You may have to follow similar way to extract the columns of the nested json
val testString = """[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]"""
val ds = Seq(testString).toDS()
spark.read.json(ds)
.select("info.age", "info.grade","name","other","random")
.show(10,false)
I have the following dataframe in spark:
root
|-- user_id: string (nullable = true)
|-- payload: string (nullable = true)
in which payload is an json string with no fixed schema, here are some sample data:
{'user_id': '001','payload': '{"country":"US","time":"11111"}'}
{'user_id': '002','payload': '{"message_id":"8936716"}'}
{'user_id': '003','payload': '{"brand":"adidas","when":""}'}
I want to output the above data in json format with the flattened payload(basically just extracting key value pairs from payload and put them into the root level), for example:
{'user_id': '001','country':'US','time':'11111'}
{'user_id': '002','message_id':'8936716'}
{'user_id': '003','brand':'adidas','when':''}
Stackoverflow said this is a duplicated question to Flatten Nested Spark Dataframe but it's not..
The difference here is that the value of payload in my case is just string type.
You can parse the payload JSON as a map<string,string> and add the user_id to the payload:
import pyspark.sql.functions as F
# input dataframe
df.show(truncate=False)
+-------+-------------------------------+
|user_id|payload |
+-------+-------------------------------+
|001 |{"country":"US","time":"11111"}|
|002 |{"message_id":"8936716"} |
|003 |{"brand":"adidas","when":""} |
+-------+-------------------------------+
df2 = df.select(
F.to_json(
F.map_concat(
F.create_map(F.lit('user_id'), F.col('user_id')),
F.from_json('payload', 'map<string,string>')
)
).alias('out')
)
df2.show(truncate=False)
+-----------------------------------------------+
|out |
+-----------------------------------------------+
|{"user_id":"001","country":"US","time":"11111"}|
|{"user_id":"002","message_id":"8936716"} |
|{"user_id":"003","brand":"adidas","when":""} |
+-----------------------------------------------+
To write it to a JSON file, you can do:
df2.coalesce(1).write.text('filepath')
This is how I finally solved the problem
json_schema = spark.read.json(source_parquet_df.rdd.map(lambda row: row.payload)).schema
new_df=source_parquet_df.withColumn('payload_json_obj',from_json(col('payload'),json_schema)).drop(source_parquet_df.payload)
flat_df = new_df.select([c for c in new_df.columns if c != 'payload_json_obj']+['payload_json_obj.*'])
I have the following dataframe
+-------+--------------------------------
|__key__|______value____________________|
| 1 | {"name":"John", "age": 34} |
| 2 | {"name":"Rose", "age": 50} |
I want to retrieve all age values within this dataframe and store it later within an array.
val x = df_clean.withColumn("value", col("value.age"))
x.show(false)
But this throws and exception.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Can't extract value from value#89: need struct type but got string;
How to resolve my requirement
EDIT
val schema = existingSparkSession.read.json(df_clean.select("value").as[String]).schema
val my_json = df_clean.select(from_json(col("value"), schema).alias("jsonValue"))
my_json.printSchema()
val df_final = my_json.withColumn("age", col("jsonValue.age"))
df_final.show(false)
Currently no exceptions are thrown. Yet I can't see any output also
EDIT 2
println("---+++++--------")
df_clean.select("value").take(1)
println("---+++++--------")
output
---+++++--------
---+++++--------
If you have long json and want to create schema then you can use from_json with schema.
import org.apache.spark.sql.functions._
val df = Seq(
(1, "{\"name\":\"John\", \"age\": 34}"),
(2, "{\"name\":\"Rose\", \"age\": 50}")
).toDF("key", "value")
val schema = spark.read.json(df.select("value").as[String]).schema
val resultDF = df.withColumn("value", from_json($"value", schema))
resultDF.show(false)
resultDF.printSchema()
Output:
+---+----------+
|key|value |
+---+----------+
|1 |{34, John}|
|2 |{50, Rose}|
+---+----------+
Schema:
root
|-- key: integer (nullable = false)
|-- value: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
If you directly need to access the nested fields then you can use get_json_object
df.withColumn("name", get_json_object($"value", "$.name"))
.withColumn("age", get_json_object($"value", "$.age"))
.show(false)
I have a JSON like
{ 1234 : "blah1", 9807: "blah2", 467: "blah_k", ...}
written to a gzipped file. It is a mapping of one ID space to another where the keys are ints and values are strings.
I want to load it as a DataFrame in Spark.
I loaded it as,
val df = spark.read.format("json").load("my_id_file.json.gz")
By default, Spark loaded it with a schema that looks like
|-- 1234: string (nullable = true)
|-- 9807: string (nullable = true)
|-- 467: string (nullable = true)
Instead, I want to my DataFrame to look like
+----+------+
|id1 |id2 |
+----+------+
|1234|blah1 |
|9007|blah2 |
|467 |blah_k|
+----+------+
So, I tried the following.
import org.apache.spark.sql.types._
val idMapSchema = StructType(Array(StructField("id1", IntegerType, true), StructField("id2", StringType, true)))
val df = spark.read.format("json").schema(idMapSchema).load("my_id_file.json.gz")
However, the loaded data frame looks like
scala> df.show
+----+----+
|id1 |id2 |
+----+----+
|null|null|
+----+----+
How can I specify the schema to fix this? Is there a "pure" dataframe approach (without creating an RDD and then creating DataFrame)?
One way to achieve this is to read the input file as textFile and apply your parsing logic within map() and then convert the result to dataframe
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
val rdd = sparkSession.sparkContext.textFile("my_input_file_path")
.map(row => {
val list = new ListBuffer[String]()
val inputJson = new JSONObject(row)
for (key <- inputJson.keySet()) {
val resultJson = new JSONObject()
resultJson.put("col1", key)
resultJson.put("col2", inputJson.get(key))
list += resultJson.toString()
}
list
}).flatMap(row => row)
val df = sparkSession.read.json(rdd)
df.printSchema()
df.show(false)
output:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
+----+------+
|col1|col2 |
+----+------+
|1234|blah1 |
|467 |blah_k|
|9807|blah2 |
+----+------+
I have a Dataframe like this:
+--+--------+--------+----+-------------+------------------------------+
|id|name |lastname|age |timestamp |creditcards |
+--+--------+--------+----+-------------+------------------------------+
|1 |michel |blanc |35 |1496756626921|[[hr6,3569823], [ee3,1547869]]|
|2 |peter |barns |25 |1496756626551|[[ye8,4569872], [qe5,3485762]]|
+--+--------+--------+----+-------------+------------------------------+
where the schema of my df is like below:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- lastname: string (nullable = true)
|-- age: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- creditcards: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- number: string (nullable = true)
I would like to convert each line to a json string knowing my schema. So this dataframe would have one column string containing the json.
first line should be like this:
{
"id":"1",
"name":"michel",
"lastname":"blanc",
"age":"35",
"timestamp":"1496756626921",
"creditcards":[
{
"id":"hr6",
"number":"3569823"
},
{
"id":"ee3",
"number":"1547869"
}
]
}
and the secone line of the dataframe should be like this:
{
"id":"2",
"name":"peter",
"lastname":"barns",
"age":"25",
"timestamp":"1496756626551",
"creditcards":[
{
"id":"ye8",
"number":"4569872"
},
{
"id":"qe5",
"number":"3485762"
}
]
}
my goal is not to write the dataframe to json file. My goal is to convert df1 to a second df2 in order to push each json line of df2 to kafka topic
I have this code to create the dataframe:
val line1 = """{"id":"1","name":"michel","lastname":"blanc","age":"35","timestamp":"1496756626921","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}]}"""
val line2 = """{"id":"2","name":"peter","lastname":"barns","age":"25","timestamp":"1496756626551","creditcards":[{"id":"ye8","number":"4569872"}, {"id":"qe5","number":"3485762"}]}"""
val rdd = sc.parallelize(Seq(line1, line2))
val df = sqlContext.read.json(rdd)
df show false
df printSchema
Do you have any idea?
If all you need is a single-column DataFrame/Dataset with each column value representing each row of the original DataFrame in JSON, you can simply apply toJSON to your DataFrame, as in the following:
df.show
// +---+------------------------------+---+--------+------+-------------+
// |age|creditcards |id |lastname|name |timestamp |
// +---+------------------------------+---+--------+------+-------------+
// |35 |[[hr6,3569823], [ee3,1547869]]|1 |blanc |michel|1496756626921|
// |25 |[[ye8,4569872], [qe5,3485762]]|2 |barns |peter |1496756626551|
// +---+------------------------------+---+--------+------+-------------+
val dsJson = df.toJSON
// dsJson: org.apache.spark.sql.Dataset[String] = [value: string]
dsJson.show
// +--------------------------------------------------------------------------+
// |value |
// +--------------------------------------------------------------------------+
// |{"age":"35","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3",...|
// |{"age":"25","creditcards":[{"id":"ye8","number":"4569872"},{"id":"qe5",...|
// +--------------------------------------------------------------------------+
[UPDATE]
To add name as an additional column, you can extract it from the JSON column using from_json:
val result = dsJson.withColumn("name", from_json($"value", df.schema)("name"))
result.show
// +--------------------+------+
// | value| name|
// +--------------------+------+
// |{"age":"35","cred...|michel|
// |{"age":"25","cred...| peter|
// +--------------------+------+
For that, you can directly convert your dataframe to a Dataset of JSON string using
val jsonDataset: Dataset[String] = df.toJSON
You can convert it into a dataframe using
val jsonDF: DataFrame = jsonDataset.toDF
Here the json will be alphabetically ordered so the output of
jsonDF show false
will be
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"age":"35","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}],"id":"1","lastname":"blanc","name":"michel","timestamp":"1496756626921"}|
|{"age":"25","creditcards":[{"id":"ye8","number":"4569872"},{"id":"qe5","number":"3485762"}],"id":"2","lastname":"barns","name":"peter","timestamp":"1496756626551"} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+