How to infer schema of serialized JSON column in Spark SQL? - json

I have a table where there is 1 column which is serialized JSON. I want to apply schema inference on this JSON column. I don't know schema to pass as input for JSON extraction (e.g: from_json function).
I can do this in Scala like
val contextSchema = spark.read.json(data.select("context").as[String]).schema
val updatedData = data.withColumn("context", from_json(col("context"), contextSchema))
How can I transform this solution to a pure Spark-SQL?

For spark-sql use toDDL to generate schema then use the schema in from_json.
Example:
df.show(10,false)
//+---+-------------------+
//|seq|json |
//+---+-------------------+
//|1 |{"id":1,"name":"a"}|
//+---+-------------------+
val sch=spark.read.json(df.select("json").as[String]).schema.toDDL
//sch: String = `id` BIGINT,`name` STRING
df.createOrReplaceTempView("tmp")
spark.sql(s"""select seq,jsn.* from (select *,from_json(json,"$sch") as jsn from tmp)""").
show(10,false)
//+---+---+----+
//|seq|id |name|
//+---+---+----+
//|1 |1 |a |
//+---+---+----+

You can use schema_of_json() function to infer JSON schema.
select from_json(<column_name>, schema_of_json(<sample_JSON>)) from <table>

Related

extract multiple columns from a json string

I have a JSON data that I want to represent in a tabular form and later write it to a different format (parquet)
Schema
root
|-- : string (nullable = true)
sample data
+----------------------------------------------+
+----------------------------------------------+
|{"deviceTypeId":"A2A","deviceId":"123","geo...|
|{"deviceTypeId":"A2B","deviceId":"456","geo...|
+----------------------------------------------+
Expected Output
+--------------+------------+
| deviceTypeId|deviceId|...|
+--------------+--------+---+
| A2A| 123| |
| A2B| 456| |
+--------------+--------+---+
I tried splitting the string, but this doesn't seem like an efficient approach
split_col = split(df_explode[''], ',')
And then extract the columns, but it appends the initial string as well.
df_1 = df_explode.withColumn('deviceId',split_col.getItem(1))
# df_1 = df_explode.withColumn('deviceTypeId',split_col.getItem(0))
printOutput(df_1)
I'm looking for a better way to solve this problem
Explode function is only to work on Array.
In your case which is a json, you should use from_json function.
Please refer from_json from pyspark.sql.functions
I was able to do it using the from_json function.
#Convert json column to multiple columns
schema = getSchema()
dfJSON = df_explode.withColumn("jsonData",from_json(col(''),schema)) \
.select("jsonData.*")
dfJSON.printSchema()
dfJSON.limit(100).toPandas()
We need to create Json Schema that will parse the Json data.
def getSchema():
schema = StructType([
StructField('deviceTypeId', StringType()),
StructField('deviceId', StringType()),
...
])
return schema
The value string is empty in this Json data so the col consists of empty string

Better/Efficient Ways to Parsing Nested JSON Column from Spark Table

I am new to Spark and Scala. I am trying to parse a Nested JSON format column from a Spark Table. Here is a sneak peek of the table (I only show the first row from the Spark Table, they all look identical for the rest of it)
doc.show(1)
doc_content object_id object_version
{"id":"lni001","pub_date".... 20220301 7098727
The structure of the "doc_content" column of each row looks like this (Some rows may have more information store inside the 'content' field):
{
"id":"lni001",
"pub_date":"20220301",
"doc_id":"7098727",
"unique_id":"64WP-UI-POLI",
"content":[
{
"c_id":"002",
"p_id":"P02",
"type":"org",
"source":"internet"
},
{
"c_id":"003",
"p_id":"P03",
"type":"org",
"source":"internet"
},
{
"c_id":"005",
"p_id":"K01",
"type":"people",
"source":"news"
}
]
}
I tried to use explode on the "doc_content" column
doc.select(explode($"doc_content") as "doc_content")
.withColumn("id", col("doc_info.id"))
.withColumn("pub_date", col("doc_info.pub_date"))
.withColumn("doc_id", col("doc_info.doc_id"))
.withColumn("unique_id", col("doc_info.unique_id"))
.withColumn("content", col("doc_info.content"))
.withColumn("content", explode($"content"))
.withColumn("c_id", col("content.c_id"))
.withColumn("p_id", col("content.p_id"))
.withColumn("type", col("content.type"))
.withColumn("source", col("content.source"))
.drop(col("doc_content"))
.drop(col("content"))
.show()
but I got this error org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`doc_content`)' due to data type mismatch: input to function explode should be array or map type, not string;. I am struggling on converting the column into Array or Map type (Probably new to Scala LOL).
After parsing the "doc_content" column, I want the table look like this.
id pub_date doc_id unique_id c_id p_id type source oject_id object_version
lni001 20220301 7098727 64WP-UI-POLI 002 P02 org internet 20220301 7098727
lni001 20220301 7098727 64WP-UI-POLI 003 P03 org internet 20220301 7098727
lni001 20220301 7098727 64WP-UI-POLI 005 K01 people news 20220301 7098727
I am wondering how can I do this and it will be great I can get some ideas or approaches on how to do this. Or maybe a better way than my approach since I have million of rows inside the Spark Table, if I can make it run faster.
Thanks!
You can use from_json to parses the JSON string into a MapType, then use explode on a array column to create new rows, what means you should explode on doc_content.content than doc_content.
Specify the schema to use parsing the json string:
import org.apache.spark.sql.types._
val schema = new StructType()
.add("id", StringType)
.add("pub_date", StringType)
.add("doc_id", StringType)
.add("unique_id", StringType)
.add("content", ArrayType(MapType(StringType, StringType)))
then parse the json string and explode it
df.select(
$"object_id",
$"object_version",
from_json($"doc_content", schema).alias("doc_content")
).select(
$"object_id",
$"object_version",
col("doc_content.id").alias("id"),
col("doc_content.pub_date").alias("pub_date"),
col("doc_content.doc_id").alias("doc_id"),
col("doc_content.unique_id").alias("unique_id"),
explode(col("doc_content.content")).alias("content")
).select(
$"id",
$"pub_date",
$"doc_id",
$"unique_id",
col("content.c_id").alias("c_id"),
col("content.p_id").alias("p_id"),
col("content.type").alias("type"),
col("content.source").alias("source"),
$"object_id",
$"object_version"
)

Spark: How to merge json objects into array

I have the following JSON Array in on of the columns of Dataframe in Spark (2.4)
[{"quoteID":"12411736"},{"quoteID":"12438257"},{"quoteID":"12438288"},{"quoteID":"12438296"},{"quoteID":"12438299"}]
I am trying to merge as
{"quoteIDs":["12411736","12438257","12438288","12438296","12438299"]}
Require help on it
Use from_json to read your column data as array<struct> then by exploding we can flatten the data.
finally do groupBy + collect_list to create an array and use to_json to create json required.
Example:
val df=Seq(("a","""[{"quoteID":"123"},{"quoteID":"456"}]""")).toDF("i","j")
df.show(false)
//+---+-------------------------------------+
//|i |j |
//+---+-------------------------------------+
//|a |[{"quoteID":"123"},{"quoteID":"456"}]|
//+---+-------------------------------------+
val sch=ArrayType(new StructType().add("quoteID",StringType))
val drop_cols=Seq("tmp1","tmp2","j")
//if you need as column in the dataframe
df.withColumn("tmp1",from_json(col("j"),sch)).
withColumn("tmp2",explode(col("tmp1"))).
selectExpr("*","tmp2.*").
groupBy("i").
agg(collect_list("quoteID").alias("quoteIDS")).
withColumn("quoteIDS",to_json(struct(col("quoteIDS")))).
drop(drop_cols:_*).
show(false)
//+---+--------------------------+
//|i |quoteIDS |
//+---+--------------------------+
//|a |{"quoteIDS":["123","456"]}|
//+---+--------------------------+
//if you need as json string
val df1=df.withColumn("tmp1",from_json(col("j"),sch)).
withColumn("tmp2",explode(col("tmp1"))).
selectExpr("*","tmp2.*").
groupBy("i").
agg(collect_list("quoteID").alias("quoteIDS"))
//toJSON (or) write as json file
df1.toJSON.first
//String = {"i":"a","quoteIDS":["123","456"]}

How to create DataFrame from Json data where two fields has same key but differ in caps?

How to create DataFrame from Json data where two fields has same key but differ in caps. For example,
{"abc1":"some-value", "ABC1":"some-other-value", "abc":"some-value"}
{"abc1":"some-value", "ABC1":"some-other-value", "abc":"some-value1"}
At present, I am getting following error,
org.apache.spark.sql.AnalysisException: Reference 'ABC1' is ambiguous.
This is how I created DataFrame,
val df = sqlContext.read.json(inputPath)
I also tried creating RDD first where I read every line and changed the key name in Json string and then converted RDD to Dataframe. This approach was very slow.
I tried multiple ways but still same problem remains.
I tried to rename the columns name but still error is there
val modDf = df
.withColumnRenamed("MCC", "MCC_CAP")
.withColumnRenamed("MNC", "MNC_CAP")
.withColumnRenamed("MCCMNC", "MCCMNC_CAP")
Created another DataFrame with renamed columns
val cols = df.columns.map(line => if (line.startsWith("M"))line.concat("_cap") else line)
val smallDf = df.toDF(cols: _*)
Dropping duplicate columns,
val capCols = df.columns.filter(line => line.startsWith("M"))
val smallDf = df.drop(capCols: _*)
You should read it as text rdd using sparkContext and then use sqlContext to read the rdd as json
sqlContext.read.json(sc.textFile("path to your json file"))
You should have your dataframe as
+----------------+-----------+----------+
|ABC1 |abc |abc1 |
+----------------+-----------+----------+
|some-other-value|some-value |some-value|
|some-other-value|some-value1|some-value|
+----------------+-----------+----------+
The dataframe generated would have defect though as duplicate column names are not allowed in spark dataframes (case insensitive)
So I would recommend to change the duplicate column names before you convert the jsons to dataframe
val rdd = sc.textFile("path to your json file").map(jsonLine => jsonLine.replace("\"ABC1\":", "\"AAA\":"))
sqlContext.read.json(rdd)
now you should have dataframe as
+----------------+-----------+----------+
|AAA |abc |abc1 |
+----------------+-----------+----------+
|some-other-value|some-value |some-value|
|some-other-value|some-value1|some-value|
+----------------+-----------+----------+

Removing Unnecessary JSON fields using SPARK (SQL)

I'm a new spark user currently playing around with Spark and some big data and I have a question related to Spark SQL or more formally the SchemaRDD. I'm reading a JSON file containing data about some weather forecasts and I'm not really interested in all of the fields that I have ... I only want 10 fields out of 50+ fields returned for each record. Is there a way (similar to filter) that I can use to specify the names of some fields that I want remove from spark.
Just a small descriptive example. Consider I have the Schema "Person" with 3 fields "Name", "Age", and "Gender" and I'm not interested in the "Age" field and wold like to remove it. Can I use spark some how to do that. ? Thanks
If you are using Spark 1.2, you can do the following (using Scala)...
If you already know what fields you want to use, you can construct the schema for these fields and apply this schema to the JSON dataset. Spark SQL will return a SchemaRDD. Then, you can register it and query it as a table. Here is a snippet...
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// The schema is encoded in a string
val schemaString = "name gender"
// Import Spark SQL data types.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Create the SchemaRDD for your JSON file "people" (every line of this file is a JSON object).
val peopleSchemaRDD = sqlContext.jsonFile("people.txt", schema)
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Only values of name and gender fields will be in the results.
val results = sqlContext.sql("SELECT * FROM people")
When you look at the schema of peopleSchemaRDD (peopleSchemaRDD.printSchema()), you will only see name and gender field.
Or, if you want to explore the dataset and determine what fields you want after you see all fields, you can ask Spark SQL to infer the schema for you. Then, you can register the SchemaRDD as a table and use projection to remove unneeded fields. Here is a snippet...
// Spark SQL will infer the schema of the given JSON file.
val peopleSchemaRDD = sqlContext.jsonFile("people.txt")
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Project name and gender field.
sqlContext.sql("SELECT name, gender FROM people")
You can specify what fields you would like to have in the schemaRDD. Below is an example. Create a case class, with only the fields that you need. Read the data into an rdd, then specify the only the fileds that you need(in the same order as you have specified the schema in the case class).
Sample Data: People.txt
foo,25,M
bar,24,F
Code:
case class Person(name: String, gender: String)
val people = sc.textFile("People.txt").map(_.split(",")).map(p => Person(p(0), p(2)))
people.registerTempTable("people")