Parsing JSON file and extracting keys and values using Spark - json

I'm new to spark. I have tried to parse the below mentioned JSON file in spark using SparkSQL but it didn't work. Can someone please help me to resolve this.
InputJSON:
[{"num":"1234","Projections":[{"Transactions":[{"14:45":0,"15:00":0}]}]}]
Expected output:
1234 14:45 0\n
1234 15:00 0
I have tried with the below code but it did not work
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.json("hdfs:/user/aswin/test.json").toDF();
val sql_output = sqlContext.sql("SELECT num, Projections.Transactions FROM df group by Projections.TotalTransactions ")
sql_output.collect.foreach(println)
Output:
[01532,WrappedArray(WrappedArray([0,0]))]

Spark recognizes your {"14:45":0,"15:00":0} map as structure so probably the only way to read your data is to specify schema manually:
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('num', StringType()), StructField('Projections', ArrayType(StructType([StructField('Transactions', ArrayType(MapType(StringType(), IntegerType())))])))])
Then you can query this temporary table to get results using multiple exploding:
>>> sqlContext.read.json('sample.json', schema=schema).registerTempTable('df')
>>> sqlContext.sql("select num, explode(col) from (select explode(col.Transactions), num from (select explode(Projections), num from df))").show()
+----+-----+-----+
| num| key|value|
+----+-----+-----+
|1234|14:45| 0|
|1234|15:00| 0|
+----+-----+-----+

Related

getting null values when parsing json column in .csv file using from_json (using spark with scala ver 2.4)

H/a, i am getting null values when using from_json , can you help me figure out the missing piece here.
~ input is the .csv file with json e.g.
id,request
1,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}
2,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}
~my code (scala/spark)
val input_df = spark.read.option("header",true).option("escape","\"").csv(json_file_input)
val json_schema_abc = StructType(Array(
StructField("Zipcode",IntegerType,true),
StructField("ZipCodeType",StringType,true),
StructField("City",StringType,true),
StructField("State",StringType,true))
)
val output_df = input_df.select($"id",from_json(col("request"),json_schema_abc).as("json_request"))
.select("id","json_request.*")
You problem is because the commas in your json column are being used as delimiters. If you have a look at the contents of you input_df:
val input_df = spark.read.option("header",true).option("escape","\"").csv(json_file_input)
input_df.show(false)
+---+--------------+
|id |request |
+---+--------------+
|1 |{"Zipcode":704|
|2 |{"Zipcode":704|
+---+--------------+
You see that the request column is not complete: it was chopped off at the first comma in the request column.
The rest of your code is correct, you can test it like this:
val input_df = Seq(
(1, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""),
(2, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}""")
).toDF("id","request")
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val json_schema_abc = StructType(Array(
StructField("Zipcode",IntegerType,true),
StructField("ZipCodeType",StringType,true),
StructField("City",StringType,true),
StructField("State",StringType,true))
)
val output_df = input_df
.select($"id",from_json(col("request"),json_schema_abc).as("json_request"))
.select("id","json_request.*")
output_df.show(false)
+---+-------+-----------+-------------------+-----+
|id |Zipcode|ZipCodeType|City |State|
+---+-------+-----------+-------------------+-----+
|1 |704 |STANDARD |PARC PARQUE |PR |
|2 |704 |STANDARD |PASEO COSTA DEL SUR|PR |
+---+-------+-----------+-------------------+-----+
I would suggest changing your CSV file's delimiter (for example ; if that does not appear in your data), that way the commas won't be bothering you.

combining Json and normal columns with Pyspark

I have a flatfile that mixes normal columns with Json columns
2020-08-05 00:00:04,489|{"Colour":"Blue", "Reason":"Sky","number":"1"}
2020-10-05 00:00:04,489|{"Colour":"Yellow", "Reason":"Flower","number":"2"}
I want to flatten it out like this using pyspark:
|Timestamp|Colour|Reason|
|--------|--------|--------|
|2020-08-05 00:00:04,489|Blue| Sky|
|2020-10-05 00:00:04,489|Yellow| Flower|
At the moment I can only figure out how to convert the Json by using spark.read.json and Map but how do you combine regular columns like the timestamp?
Lets reconstruct your data
data2 = [("2020-08-05 00:00:04,489",'{"Colour":"Blue", "Reason":"Sky","number":"1"}'),
("2020-10-05 00:00:04,489",'{"Colour":"Yellow", "Reason":"Flower","number":"2"}')]
schema = StructType([ \
StructField("x",StringType(),True), \
StructField("y",StringType(),True)])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
As per documentation we can use schema_of_json to parses a JSON string and infers its schema in DDL format
schema=df.select(F.schema_of_json(df.select("y").first()[0])).first()[0]
df.withColumn("y", F.from_json("y",\ schema)).selectExpr('x',"y.*").show(truncate=False)
+-----------------------+------+------+------+
|x |Colour|Reason|number|
+-----------------------+------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |1 |
|2020-10-05 00:00:04,489|Yellow|Flower|2 |
+-----------------------+------+------+------+
You can use get_json_object. Assuming that the original columns are called col1 and col2, then you can do:
df2 = df.select(
F.col('col1').alias('Timestamp'),
F.get_json_object('col2', '$.Colour').alias('Colour'),
F.get_json_object('col2', '$.Reason').alias('Reason')
)
df2.show(truncate=False)
+-----------------------+------+------+
|Timestamp |Colour|Reason|
+-----------------------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |
|2020-10-05 00:00:04,489|Yellow|Flower|
+-----------------------+------+------+
Or you can use from_json:
import pyspark.sql.functions as F
df2 = df.select(
F.col('col1').alias('Timestamp'),
F.from_json('col2', 'Colour string, Reason string').alias('col2')
).select('Timestamp', 'col2.*')
df2.show(truncate=False)
+-----------------------+------+------+
|Timestamp |Colour|Reason|
+-----------------------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |
|2020-10-05 00:00:04,489|Yellow|Flower|
+-----------------------+------+------+

How to split Array of Json DataFrame into multiple possible number of rows in Scala

How can I split an Array of Json DataFrame to multiple rows in Spark-Scala
Input DataFrame :
+----------+-------------+-----------------------------------------------------------------------------------------------------------------------------+
|item_id |s_tag |jsonString |
+----------+-------------+-----------------------------------------------------------------------------------------------------------------------------+
|Item_12345|S_12345|[{"First":{"Info":"ABCD123","Res":"5.2"}},{"Second":{"Info":"ABCD123","Res":"5.2"}},{"Third":{"Info":"ABCD123","Res":"5.2"}}] |
+----------+-------------+-----------------------------------------------------------------------------------------------------------------------------+
Output DataFrame :
+----------+-------------------------------------------------+
|item_id |s_tag |jsonString |
+----------+-------------------------------------------------+
|Item_12345|S_12345|{"First":{"Info":"ABCD123","Res":"5.2"}} |
+----------+-------------------------------------------------+
|Item_12345|S_12345|{"Second":{"Info":"ABCD123","Res":"5.2"}}|
+----------+-------------------------------------------------+
|Item_12345|S_12345|{"Third":{"Info":"ABCD123","Res":"5.2"}} |
+----------+-------------------------------------------------+
This is what I have tried so far but it did not work
val rawDF = sparkSession
.sql("select 1")
.withColumn("item_id", lit("Item_12345")).withColumn("s_tag", lit("S_12345"))
.withColumn("jsonString", lit("""[{"First":{"Info":"ABCD123","Res":"5.2"}},{"Second":{"Info":"ABCD123","Res":"5.2"}},{"Third":{"Info":"ABCD123","Res":"5.2"}}]"""))
val newDF = RawDF.withColumn("splittedJson", explode(RawDF.col("jsonString")))
The issue in the example code you posted is that the json is represented as a string and hence cannot be exploded. Try something like this:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{typedLit, _}
object tmp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val arr = Seq("{\"First\":{\"Info\":\"ABCD123\",\"Res\":\"5.2\"}}",
"{\"Second\":{\"Info\":\"ABCD123\",\"Res\":\"5.2\"}}",
"{\"Third\":{\"Info\":\"ABCD123\",\"Res\":\"5.2\"}}")
val rawDF = spark.sql("select 1")
.withColumn("item_id", lit("Item_12345"))
.withColumn("s_tag", lit("S_12345"))
.withColumn("jsonString", typedLit(arr))
val newDF = rawDF.withColumn("splittedJson", explode(rawDF.col("jsonString")))
newDF.show()
}
}

Timestamp format getting convert when Dataframe.toJSON is done in spark scala

I have timestamp column in hive table that is read into Dataframe using the spark sql.
Once I have the Dataframe I convert the dataframe to JSON string using the toJSON function in Spark.
But the timestamp format is converted after applying toJSON to the dataframe
Code and output as follows.
scala> newDF.show(false)
+--------------------------+--------------------------+
|current_ts |new_ ts |
+--------------------------+--------------------------+
|2019-04-10 01:00:27.551022|2019-04-10 06:00:27.551022|
|2019-04-10 01:00:49.07757 |2019-04-10 06:00:49.07757 |
scala> newDF.toJSON.show(false)
+-------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------+
|{" current_ts ":"2019-04-10T01:00:27.551-05:00","new_ ts":"2019-04-10T06:00:27.551-05:00"}|
|{" current_ts ":"2019-04-10T01:00:49.077-05:00","new_ ts":"2019-04-10T06:00:49.077-05:00"}|
Above out is not accepted, we need to have the timestamp as its displayed in the Dataframe without casting it to String data type.
Output I need is as follows
+-------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------+
|{" current_ts ":"2019-04-10T01:00:27.551022","new_ ts":"2019-04-10T06:00:27.551022"}|
|{" current_ts ":"2019-04-10T01:00:49.07757","new_ ts":"2019-04-10T06:00:49.07757"}|
I am getting the expected output. Please see below:
scala> val df = Seq(("2019-04-10 01:00:27.551022", "2019-04-10 06:00:27.551022"),("2019-04-10 01:00:49.07757", "2019-04-10
06:00:49.07757").toDF("current_ts","new_ts")
Output
scala> df.toJSON.show(false)
+---------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------+
|{"current_ts":"2019-04-10 01:00:27.551022","new_ts":"2019-04-10 06:00:27.551022"}|
|{"current_ts":"2019-04-10 01:00:49.07757","new_ts":"2019-04-10 06:00:49.07757"} |
+---------------------------------------------------------------------------------+
I am using Spark 2.4. can you please specify the version also.
Thanks.

Create Spark Dataframe from SQL Query

I'm sure this is a simple SQLContext question, but I can't find any answer in the Spark docs or Stackoverflow
I want to create a Spark Dataframe from a SQL Query on MySQL
For example, I have a complicated MySQL query like
SELECT a.X,b.Y,c.Z FROM FOO as a JOIN BAR as b ON ... JOIN ZOT as c ON ... WHERE ...
and I want a Dataframe with Columns X,Y and Z
I figured out how to load entire tables into Spark, and I could load them all, and then do the joining and selection there. However, that is very inefficient. I just want to load the table generated by my SQL query.
Here is my current approximation of the code, that doesn't work. Mysql-connector has an option "dbtable" that can be used to load a whole table. I am hoping there is some way to specify a query
val df = sqlContext.format("jdbc").
option("url", "jdbc:mysql://localhost:3306/local_content").
option("driver", "com.mysql.jdbc.Driver").
option("useUnicode", "true").
option("continueBatchOnError","true").
option("useSSL", "false").
option("user", "root").
option("password", "").
sql(
"""
select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
join DialogLine as dl on dl.DialogID=d.DialogID
join DialogLineWordInstanceMatch as dlwim o n dlwim.DialogLineID=dl.DialogLineID
join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
join WordRoot as wr on wr.WordRootID=wi.WordRootID
where d.InSite=1 and dl.Active=1
limit 100
"""
).load()
I found this here Bulk data migration through Spark SQL
The dbname parameter can be any query wrapped in parenthesis with an alias. So in my case, I need to do this:
val query = """
(select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
join DialogLine as dl on dl.DialogID=d.DialogID
join DialogLineWordInstanceMatch as dlwim on dlwim.DialogLineID=dl.DialogLineID
join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
join WordRoot as wr on wr.WordRootID=wi.WordRootID
where d.InSite=1 and dl.Active=1
limit 100) foo
"""
val df = sqlContext.format("jdbc").
option("url", "jdbc:mysql://localhost:3306/local_content").
option("driver", "com.mysql.jdbc.Driver").
option("useUnicode", "true").
option("continueBatchOnError","true").
option("useSSL", "false").
option("user", "root").
option("password", "").
option("dbtable",query).
load()
As expected, loading each table as its own Dataframe and joining them in Spark was very inefficient.
If you have your table already registered in your SQLContext, you could simply use sql method.
val resultDF = sqlContext.sql("SELECT a.X,b.Y,c.Z FROM FOO as a JOIN BAR as b ON ... JOIN ZOT as c ON ... WHERE ...")
to save the output of a query to a new dataframe, simple set the result equal to a variable:
val newDataFrame = spark.sql("SELECT a.X,b.Y,c.Z FROM FOO as a JOIN BAR as b ON ... JOIN ZOT as c ON ... WHERE ...")
and now newDataFrame is a dataframe with all the dataframe functionalities available to it.
TL;DR: just create a view in your database.
Detail:
I have a table t_city in my postgres database, on which I create a view:
create view v_city_3500 as
select asciiname, country, population, elevation
from t_city
where elevation>3500
and population>100000
select * from v_city_3500;
asciiname | country | population | elevation
-----------+---------+------------+-----------
Potosi | BO | 141251 | 3967
Oruro | BO | 208684 | 3936
La Paz | BO | 812799 | 3782
Lhasa | CN | 118721 | 3651
Puno | PE | 116552 | 3825
Juliaca | PE | 245675 | 3834
In the spark-shell:
val sx= new org.apache.spark.sql.SQLContext(sc)
var props=new java.util.Properties()
props.setProperty("driver", "org.postgresql.Driver" )
val url="jdbc:postgresql://buya/dmn?user=dmn&password=dmn"
val city_df=sx.read.jdbc(url=url,table="t_city",props)
val city_3500_df=sx.read.jdbc(url=url,table="v_city_3500",props)
Result:
city_df.count()
Long = 145725
city_3500_df.count()
Long = 6
with MYSQL read/loading data something like below
val conf = new SparkConf().setAppName("SparkMe Application").setMaster("local[2]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:mysql://<host>:3306/corbonJDBC?user=user&password=password",
"dbtable" -> "TABLE_NAME")).load()
write data to table as below
import java.util.Properties
val prop = new Properties()
prop.put("user", "<>")
prop.put("password", "simple$123")
val dfWriter = jdbcDF.write.mode("append")
dfWriter.jdbc("jdbc:mysql://<host>:3306/corbonJDBC?user=user&password=password", "tableName", prop)
to create dataframe from query do something like below
val finalModelDataDF = {
val query = "select * from table_name"
sqlContext.sql(query)
};
finalModelDataDF.show()