Writing out spark dataframe as nested JSON doc - json

I have a spark dataframe as:
A B val_of_B val1 val2 val3 val4
"c1" "MCC" "cd1" 1 2 1.1 1.05
"c1" "MCC" "cd2" 2 3 1.1 1.05
"c1" "MCC" "cd3" 3 4 1.1 1.05
val1 and val2 are obtained with group by of A, B and val_of_B where as val3, val4 is A level information only (for example, distinct of A, val3 is only "c1",1.1)
I would like to write this out as nested JSON, which should look like:
For each A, JSON format should look like
{"val3": 1.1, "val4": 1.05, "MCC":[["cd1",1,2], ["cd2",2,3], ["cd3",3,4]]}
Is it possible to accomplish this with existing tools under spark api? If not, can you provide guidelines?

You should groupBy on column A and aggregate necessary columns using first and collect_list and array inbuilt functions
import org.apache.spark.sql.functions._
def zipping = udf((arr1: Seq[String], arr2: Seq[Seq[String]])=> arr1.indices.map(index => Array(arr1(index))++arr2(index)))
val jsonDF = df.groupBy("A")
.agg(first(col("val3")).as("val3"), first(col("val4")).as("val4"), first(col("B")).as("B"), collect_list("val_of_B").as("val_of_B"), collect_list(array("val1", "val2")).as("list"))
.select(col("val3"), col("val4"), col("B"), zipping(col("val_of_B"), col("list")).as("list"))
.toJSON
which should give you
+-----------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","B":"MCC","list":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+-----------------------------------------------------------------------------------------------+
Next is to exchange the list name to value of B using a udf function as
def exchangeName = udf((json: String)=> {
val splitted = json.split(",")
val name = splitted(2).split(":")(1).trim
val value = splitted(3).split(":")(1).trim
splitted(0).trim+","+splitted(1).trim+","+name+":"+value+","+(4 until splitted.size).map(splitted(_)).mkString(",")
})
jsonDF.select(exchangeName(col("value")).as("json"))
.show(false)
which should give you your desired output
+------------------------------------------------------------------------------------+
|json |
+------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","MCC":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+------------------------------------------------------------------------------------+

Related

combining Json and normal columns with Pyspark

I have a flatfile that mixes normal columns with Json columns
2020-08-05 00:00:04,489|{"Colour":"Blue", "Reason":"Sky","number":"1"}
2020-10-05 00:00:04,489|{"Colour":"Yellow", "Reason":"Flower","number":"2"}
I want to flatten it out like this using pyspark:
|Timestamp|Colour|Reason|
|--------|--------|--------|
|2020-08-05 00:00:04,489|Blue| Sky|
|2020-10-05 00:00:04,489|Yellow| Flower|
At the moment I can only figure out how to convert the Json by using spark.read.json and Map but how do you combine regular columns like the timestamp?
Lets reconstruct your data
data2 = [("2020-08-05 00:00:04,489",'{"Colour":"Blue", "Reason":"Sky","number":"1"}'),
("2020-10-05 00:00:04,489",'{"Colour":"Yellow", "Reason":"Flower","number":"2"}')]
schema = StructType([ \
StructField("x",StringType(),True), \
StructField("y",StringType(),True)])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
As per documentation we can use schema_of_json to parses a JSON string and infers its schema in DDL format
schema=df.select(F.schema_of_json(df.select("y").first()[0])).first()[0]
df.withColumn("y", F.from_json("y",\ schema)).selectExpr('x',"y.*").show(truncate=False)
+-----------------------+------+------+------+
|x |Colour|Reason|number|
+-----------------------+------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |1 |
|2020-10-05 00:00:04,489|Yellow|Flower|2 |
+-----------------------+------+------+------+
You can use get_json_object. Assuming that the original columns are called col1 and col2, then you can do:
df2 = df.select(
F.col('col1').alias('Timestamp'),
F.get_json_object('col2', '$.Colour').alias('Colour'),
F.get_json_object('col2', '$.Reason').alias('Reason')
)
df2.show(truncate=False)
+-----------------------+------+------+
|Timestamp |Colour|Reason|
+-----------------------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |
|2020-10-05 00:00:04,489|Yellow|Flower|
+-----------------------+------+------+
Or you can use from_json:
import pyspark.sql.functions as F
df2 = df.select(
F.col('col1').alias('Timestamp'),
F.from_json('col2', 'Colour string, Reason string').alias('col2')
).select('Timestamp', 'col2.*')
df2.show(truncate=False)
+-----------------------+------+------+
|Timestamp |Colour|Reason|
+-----------------------+------+------+
|2020-08-05 00:00:04,489|Blue |Sky |
|2020-10-05 00:00:04,489|Yellow|Flower|
+-----------------------+------+------+

Count occurance of an element in PySpark DataFrame

I have a csv file marks.csv. I have read it using pyspark and created a data frame df.
It looks like this (the csv file):
sub1,sub2,sub3
a,a,b
b,b,a
c,a,b
How can I get the count of ‘a’ in each column in the data frame df?
Thanks.
As we can leverage SQL's features in Spark, we can simply do as below:
df.selectExpr("sum(if( sub1 = 'a' , 1, 0 )) as count1","sum(if( sub2 = 'a' , 1, 0 )) as count2","sum(if( sub3 = 'a' , 1, 0 )) as count3").show()
It should give output as below:
+------+------+------+
|count1|count2|count3|
+------+------+------+
| 1| 2| 1|
+------+------+------+
To know more about spark SQL please visit this.
:EDIT:
If you want to do it for all columns then you can try something like below:
from pyspark.sql.types import Row
final_out = spark.createDataFrame([Row()]) # create an empty dataframe
#Just loop through all columns
for col_name in event_df.columns:
final_out = final_out.crossJoin(event_df.selectExpr("sum(if( "+col_name+" = 'a' , 1, 0 )) as "+ col_name))
final_out.show()
It should give you output like below:
+----+----+----+
|sub1|sub2|sub3|
+----+----+----+
| 1| 2| 1|
+----+----+----+
You can use CASE when statement to get count of "a" in each column
import pyspark.sql.functions as F
df2 = df.select(
F.sum(when(df("sub1")=="a",1).otherwise(0)).alias("sub1_cnt"),
F.sum(when(df("sub2") == "a",1).otherwise(0)).alias("sub2_cnt"),
F.sum(when(df("sub3") == "a",1).otherwise(0)).alias("sub3_cnt"))
df2.show()

Remove special characters from csv data using Spark

I want to remove the specific(e.g. #,&) special characters from csv data using PySpark. I have gone through optimuspyspark(https://github.com/ironmussa/Optimus). However it's removing all the special characters. I want to remove specific special characters from the CSV data using Spark. Is there any inbuilt functions or custom functions or third party librabies to achieve this functionality. Thanks in advance.
Few Links I tried :
https://community.hortonworks.com/questions/49802/escaping-double-quotes-in-spark-dataframe.html
Hope this is what you are looking for:
assume you have a simple csv file (2 lines) that looks like this:
#A 234, 'B' 225, 'C' !556
#D 235, 'E' 2256, 'F'! 557
read csv into dataframe:
df=spark.read.csv('test1.csv',mode="DROPMALFORMED",\
inferSchema=True,\
header = False)
df.show()
+------+---------+---------+
| _c0| _c1| _c2|
+------+---------+---------+
|#A 234| 'B' 225| 'C' !556|
|#D 235| 'E' 2256| 'F'! 557|
+------+---------+---------+
use pyspark functions to remove specific unwanted characters
from pyspark.sql.functions import *
newDf = df.withColumn('_c0', regexp_replace('_c0', '#', ''))\
.withColumn('_c1', regexp_replace('_c1', "'", ''))\
.withColumn('_c2', regexp_replace('_c2', '!', ''))
newDf.show()
+-----+-------+--------+
| _c0| _c1| _c2|
+-----+-------+--------+
|A 234| B 225| 'C' 556|
|D 235| E 2256| 'F' 557|
+-----+-------+--------+
if you want to remove a specific character from ALL columns try this:
starting with the same simplified textfile/dataFrame as above:
+------+---------+---------+
| _c0| _c1| _c2|
+------+---------+---------+
|#A 234| 'B' 225| 'C' !556|
|#D 235| 'E' 2256| 'F'! 557|
+------+---------+---------+
function to remove a character from a column in a dataframe:
def cleanColumn(tmpdf,colName,findChar,replaceChar):
tmpdf = tmpdf.withColumn(colName, regexp_replace(colName, findChar, replaceChar))
return tmpdf
remove the " ' " character from ALL columns in the df (replace with nothing i.e. "")
allColNames = df.schema.names
charToRemove= "'"
replaceWith =""
for colName in allColNames:
df=cleanColumn(df,colName,charToRemove,replaceWith)
The resultant output is:
df.show()
+------+-------+-------+
| _c0| _c1| _c2|
+------+-------+-------+
|#A 234| B 225| C !556|
|#D 235| E 2256| F! 557|
+------+-------+-------+
With Optimus you can:
df.cols.replace("*",["a","b","c"]," ").table()
Use the start to select all the columns.
Pass an array with the elements you want to match
The char that will replace the match. Whitespace in this case.

Spark Row to JSON

I would like to create a JSON from a Spark v.1.6 (using scala) dataframe. I know that there is the simple solution of doing df.toJSON.
However, my problem looks a bit different. Consider for instance a dataframe with the following columns:
| A | B | C1 |  C2 | C3 |
-------------------------------------------
| 1 | test | ab | 22 | TRUE |
| 2 | mytest | gh | 17 | FALSE |
I would like to have at the end a dataframe with
| A | B | C |
----------------------------------------------------------------
| 1 | test | { "c1" : "ab", "c2" : 22, "c3" : TRUE } |
| 2 | mytest | { "c1" : "gh", "c2" : 17, "c3" : FALSE } |
where C is a JSON containing C1, C2, C3. Unfortunately, I at compile time I do not know what the dataframe looks like (except the columns A and B that are always "fixed").
As for the reason why I need this: I am using Protobuf for sending around the results. Unfortunately, my dataframe sometimes has more columns than expected and I would still send those via Protobuf, but I do not want to specify all columns in the definition.
How can I achieve this?
Spark 2.1 should have native support for this use case (see #15354).
import org.apache.spark.sql.functions.to_json
df.select(to_json(struct($"c1", $"c2", $"c3")))
I use this command to solve the to_json problem:
output_df = (df.select(to_json(struct(col("*"))).alias("content")))
Here, no JSON parser, and it adapts to your schema:
import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
df.select(
col(df.columns(0)),
col(df.columns(1)),
concat(
lit("{"),
concat_ws(",",df.dtypes.slice(2, df.dtypes.length).map(dt => {
val c = dt._1;
val t = dt._2;
concat(
lit("\"" + c + "\":" + (if (t == "StringType") "\""; else "") ),
col(c),
lit(if(t=="StringType") "\""; else "")
)
}):_*),
lit("}")
) as "C"
).collect()
First lets convert C's to a struct:
val dfStruct = df.select($"A", $"B", struct($"C1", $"C2", $"C3").alias("C"))
This is structure can be converted to JSONL using toJSON as before:
dfStruct.toJSON.collect
// Array[String] = Array(
// {"A":1,"B":"test","C":{"C1":"ab","C2":22,"C3":true}},
// {"A":2,"B":"mytest","C":{"C1":"gh","C2":17,"C3":false}})
I am not aware of any built-in method that can convert a single column but you can either convert it individually and join or use your favorite JSON parser in an UDF.
case class C(C1: String, C2: Int, C3: Boolean)
object CJsonizer {
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.write
implicit val formats = Serialization.formats(org.json4s.NoTypeHints)
def toJSON(c1: String, c2: Int, c3: Boolean) = write(C(c1, c2, c3))
}
val cToJSON = udf((c1: String, c2: Int, c3: Boolean) =>
CJsonizer.toJSON(c1, c2, c3))
df.withColumn("c_json", cToJSON($"C1", $"C2", $"C3"))

Why does reading csv file with empty values lead to IndexOutOfBoundException?

I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
When I tried to perform an action on rowRDD, gives the error.
Any help is greatly appreciated.
This is not answer to your question. But it may help to solve your problem.
From the question I see that you are trying to create a dataframe from a CSV.
Creating dataframe using CSV can be easily done using spark-csv package
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
You can also inferSchema with latest version. See this answer
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):
David,1,2,10,,11
The problem is your CSV file contains 6 columns, yet with:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
You try to read 7 columns. Just change your mapping to:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
And Spark will take care of the rest.
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it
David,1,2,10,,11
You can read the csv file as text file as follow
fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
And then you can use your code to create dataframe from it
You can do it as follows.
val df = sqlContext
.read
.textfile(csvFilePath)
.map(_.split(delimiter_of_file, -1)
.map(
p =>
Row(
p(0),
p(1),
p(2),
p(3),
p(4),
p(5),
p(6))
Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.