combining Json and normal columns with Pyspark - json

I have a flatfile that mixes normal columns with Json columns
2020-08-05 00:00:04,489|{"Colour":"Blue", "Reason":"Sky","number":"1"}
2020-10-05 00:00:04,489|{"Colour":"Yellow", "Reason":"Flower","number":"2"}
I want to flatten it out like this using pyspark:
|Timestamp|Colour|Reason|
|--------|--------|--------|
|2020-08-05 00:00:04,489|Blue| Sky|
|2020-10-05 00:00:04,489|Yellow| Flower|
At the moment I can only figure out how to convert the Json by using spark.read.json and Map but how do you combine regular columns like the timestamp?

Lets reconstruct your data
data2 = [("2020-08-05 00:00:04,489",'{"Colour":"Blue", "Reason":"Sky","number":"1"}'),
("2020-10-05 00:00:04,489",'{"Colour":"Yellow", "Reason":"Flower","number":"2"}')]
schema = StructType([ \
StructField("x",StringType(),True), \
StructField("y",StringType(),True)])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
As per documentation we can use schema_of_json to parses a JSON string and infers its schema in DDL format
schema=df.select(F.schema_of_json(df.select("y").first()[0])).first()[0]
df.withColumn("y", F.from_json("y",\ schema)).selectExpr('x',"y.*").show(truncate=False)
+-----------------------+------+------+------+
|x |Colour|Reason|number|
+-----------------------+------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |1 |
|2020-10-05 00:00:04,489|Yellow|Flower|2 |
+-----------------------+------+------+------+

You can use get_json_object. Assuming that the original columns are called col1 and col2, then you can do:
df2 = df.select(
F.col('col1').alias('Timestamp'),
F.get_json_object('col2', '$.Colour').alias('Colour'),
F.get_json_object('col2', '$.Reason').alias('Reason')
)
df2.show(truncate=False)
+-----------------------+------+------+
|Timestamp |Colour|Reason|
+-----------------------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |
|2020-10-05 00:00:04,489|Yellow|Flower|
+-----------------------+------+------+
Or you can use from_json:
import pyspark.sql.functions as F
df2 = df.select(
F.col('col1').alias('Timestamp'),
F.from_json('col2', 'Colour string, Reason string').alias('col2')
).select('Timestamp', 'col2.*')
df2.show(truncate=False)
+-----------------------+------+------+
|Timestamp |Colour|Reason|
+-----------------------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |
|2020-10-05 00:00:04,489|Yellow|Flower|
+-----------------------+------+------+

Related

getting null values when parsing json column in .csv file using from_json (using spark with scala ver 2.4)

H/a, i am getting null values when using from_json , can you help me figure out the missing piece here.
~ input is the .csv file with json e.g.
id,request
1,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}
2,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}
~my code (scala/spark)
val input_df = spark.read.option("header",true).option("escape","\"").csv(json_file_input)
val json_schema_abc = StructType(Array(
StructField("Zipcode",IntegerType,true),
StructField("ZipCodeType",StringType,true),
StructField("City",StringType,true),
StructField("State",StringType,true))
)
val output_df = input_df.select($"id",from_json(col("request"),json_schema_abc).as("json_request"))
.select("id","json_request.*")
You problem is because the commas in your json column are being used as delimiters. If you have a look at the contents of you input_df:
val input_df = spark.read.option("header",true).option("escape","\"").csv(json_file_input)
input_df.show(false)
+---+--------------+
|id |request |
+---+--------------+
|1 |{"Zipcode":704|
|2 |{"Zipcode":704|
+---+--------------+
You see that the request column is not complete: it was chopped off at the first comma in the request column.
The rest of your code is correct, you can test it like this:
val input_df = Seq(
(1, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""),
(2, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}""")
).toDF("id","request")
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val json_schema_abc = StructType(Array(
StructField("Zipcode",IntegerType,true),
StructField("ZipCodeType",StringType,true),
StructField("City",StringType,true),
StructField("State",StringType,true))
)
val output_df = input_df
.select($"id",from_json(col("request"),json_schema_abc).as("json_request"))
.select("id","json_request.*")
output_df.show(false)
+---+-------+-----------+-------------------+-----+
|id |Zipcode|ZipCodeType|City |State|
+---+-------+-----------+-------------------+-----+
|1 |704 |STANDARD |PARC PARQUE |PR |
|2 |704 |STANDARD |PASEO COSTA DEL SUR|PR |
+---+-------+-----------+-------------------+-----+
I would suggest changing your CSV file's delimiter (for example ; if that does not appear in your data), that way the commas won't be bothering you.

How do you parse an array of JSON objects into a Spark Dataframe?

I have a collection of JSON files containing Twitter data that I'd like to use as a datasource for structured streaming in Databricks/Spark. The JSON files have the following structure:
[{...tweet data...},{...tweet data...},{...tweet data...},...]
My PySpark code:
# Stream from the /tmp/tweets folder
tweetstore = "/tmp/tweets/"
# Set up the folder as a streaming source
streamingInputDF = (
spark \
.readStream \
.schema(json_schema) \
.json(tweetstore)
)
# Check
streamingInputDF.isStreaming
# Access the DF using SQL
streamingQuery = streamingInputDF \
.select("run_stamp", "user", "id", "source", "favorite_count", "retweet_count")\
.writeStream \
.format("memory") \
.queryName("tweetstream") \
.outputMode("append")\
.start()
streamingDF = spark.sql("select * from tweetstream order by 1 desc")
My output looks like this:
Number of entries in dataframe: 3875046
+---------+----+----+------+--------------+-------------+
|run_stamp|user|id |source|favorite_count|retweet_count|
+---------+----+----+------+--------------+-------------+
|null |null|null|null |null |null |
|null |null|null|null |null |null |
|null |null|null|null |null |null |
From what I can tell, I probably need to use UDF or explode() to parse the JSON array properly but haven't quite figured out how so far.
Its working well for me on sample data-
val data = """[{"id":1,"name":"abc1"},{"id":2,"name":"abc2"},{"id":3,"name":"abc3"}]"""
val df = spark.read.json(Seq(data).toDS())
df.show(false)
df.printSchema()
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*
* root
* |-- id: long (nullable = true)
* |-- name: string (nullable = true)
*/
Documenting the answer for others who might stumble upon this: I realised the JSON didn't have one object per line as Spark expects. The key then, was to add .option("multiline", True), i.e.:
streamingInputDF = (
spark \
.readStream \
.option("multiline", True) \
.schema(json_schema) \
.json(tweetstore)
)

Timestamp format getting convert when Dataframe.toJSON is done in spark scala

I have timestamp column in hive table that is read into Dataframe using the spark sql.
Once I have the Dataframe I convert the dataframe to JSON string using the toJSON function in Spark.
But the timestamp format is converted after applying toJSON to the dataframe
Code and output as follows.
scala> newDF.show(false)
+--------------------------+--------------------------+
|current_ts |new_ ts |
+--------------------------+--------------------------+
|2019-04-10 01:00:27.551022|2019-04-10 06:00:27.551022|
|2019-04-10 01:00:49.07757 |2019-04-10 06:00:49.07757 |
scala> newDF.toJSON.show(false)
+-------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------+
|{" current_ts ":"2019-04-10T01:00:27.551-05:00","new_ ts":"2019-04-10T06:00:27.551-05:00"}|
|{" current_ts ":"2019-04-10T01:00:49.077-05:00","new_ ts":"2019-04-10T06:00:49.077-05:00"}|
Above out is not accepted, we need to have the timestamp as its displayed in the Dataframe without casting it to String data type.
Output I need is as follows
+-------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------+
|{" current_ts ":"2019-04-10T01:00:27.551022","new_ ts":"2019-04-10T06:00:27.551022"}|
|{" current_ts ":"2019-04-10T01:00:49.07757","new_ ts":"2019-04-10T06:00:49.07757"}|
I am getting the expected output. Please see below:
scala> val df = Seq(("2019-04-10 01:00:27.551022", "2019-04-10 06:00:27.551022"),("2019-04-10 01:00:49.07757", "2019-04-10
06:00:49.07757").toDF("current_ts","new_ts")
Output
scala> df.toJSON.show(false)
+---------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------+
|{"current_ts":"2019-04-10 01:00:27.551022","new_ts":"2019-04-10 06:00:27.551022"}|
|{"current_ts":"2019-04-10 01:00:49.07757","new_ts":"2019-04-10 06:00:49.07757"} |
+---------------------------------------------------------------------------------+
I am using Spark 2.4. can you please specify the version also.
Thanks.

Parsing JSON file and extracting keys and values using Spark

I'm new to spark. I have tried to parse the below mentioned JSON file in spark using SparkSQL but it didn't work. Can someone please help me to resolve this.
InputJSON:
[{"num":"1234","Projections":[{"Transactions":[{"14:45":0,"15:00":0}]}]}]
Expected output:
1234 14:45 0\n
1234 15:00 0
I have tried with the below code but it did not work
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.json("hdfs:/user/aswin/test.json").toDF();
val sql_output = sqlContext.sql("SELECT num, Projections.Transactions FROM df group by Projections.TotalTransactions ")
sql_output.collect.foreach(println)
Output:
[01532,WrappedArray(WrappedArray([0,0]))]
Spark recognizes your {"14:45":0,"15:00":0} map as structure so probably the only way to read your data is to specify schema manually:
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('num', StringType()), StructField('Projections', ArrayType(StructType([StructField('Transactions', ArrayType(MapType(StringType(), IntegerType())))])))])
Then you can query this temporary table to get results using multiple exploding:
>>> sqlContext.read.json('sample.json', schema=schema).registerTempTable('df')
>>> sqlContext.sql("select num, explode(col) from (select explode(col.Transactions), num from (select explode(Projections), num from df))").show()
+----+-----+-----+
| num| key|value|
+----+-----+-----+
|1234|14:45| 0|
|1234|15:00| 0|
+----+-----+-----+

Why does reading csv file with empty values lead to IndexOutOfBoundException?

I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
When I tried to perform an action on rowRDD, gives the error.
Any help is greatly appreciated.
This is not answer to your question. But it may help to solve your problem.
From the question I see that you are trying to create a dataframe from a CSV.
Creating dataframe using CSV can be easily done using spark-csv package
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
You can also inferSchema with latest version. See this answer
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):
David,1,2,10,,11
The problem is your CSV file contains 6 columns, yet with:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
You try to read 7 columns. Just change your mapping to:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
And Spark will take care of the rest.
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it
David,1,2,10,,11
You can read the csv file as text file as follow
fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
And then you can use your code to create dataframe from it
You can do it as follows.
val df = sqlContext
.read
.textfile(csvFilePath)
.map(_.split(delimiter_of_file, -1)
.map(
p =>
Row(
p(0),
p(1),
p(2),
p(3),
p(4),
p(5),
p(6))
Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.