Spark Row to JSON - json

I would like to create a JSON from a Spark v.1.6 (using scala) dataframe. I know that there is the simple solution of doing df.toJSON.
However, my problem looks a bit different. Consider for instance a dataframe with the following columns:
| A | B | C1 |  C2 | C3 |
-------------------------------------------
| 1 | test | ab | 22 | TRUE |
| 2 | mytest | gh | 17 | FALSE |
I would like to have at the end a dataframe with
| A | B | C |
----------------------------------------------------------------
| 1 | test | { "c1" : "ab", "c2" : 22, "c3" : TRUE } |
| 2 | mytest | { "c1" : "gh", "c2" : 17, "c3" : FALSE } |
where C is a JSON containing C1, C2, C3. Unfortunately, I at compile time I do not know what the dataframe looks like (except the columns A and B that are always "fixed").
As for the reason why I need this: I am using Protobuf for sending around the results. Unfortunately, my dataframe sometimes has more columns than expected and I would still send those via Protobuf, but I do not want to specify all columns in the definition.
How can I achieve this?

Spark 2.1 should have native support for this use case (see #15354).
import org.apache.spark.sql.functions.to_json
df.select(to_json(struct($"c1", $"c2", $"c3")))

I use this command to solve the to_json problem:
output_df = (df.select(to_json(struct(col("*"))).alias("content")))

Here, no JSON parser, and it adapts to your schema:
import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
df.select(
col(df.columns(0)),
col(df.columns(1)),
concat(
lit("{"),
concat_ws(",",df.dtypes.slice(2, df.dtypes.length).map(dt => {
val c = dt._1;
val t = dt._2;
concat(
lit("\"" + c + "\":" + (if (t == "StringType") "\""; else "") ),
col(c),
lit(if(t=="StringType") "\""; else "")
)
}):_*),
lit("}")
) as "C"
).collect()

First lets convert C's to a struct:
val dfStruct = df.select($"A", $"B", struct($"C1", $"C2", $"C3").alias("C"))
This is structure can be converted to JSONL using toJSON as before:
dfStruct.toJSON.collect
// Array[String] = Array(
// {"A":1,"B":"test","C":{"C1":"ab","C2":22,"C3":true}},
// {"A":2,"B":"mytest","C":{"C1":"gh","C2":17,"C3":false}})
I am not aware of any built-in method that can convert a single column but you can either convert it individually and join or use your favorite JSON parser in an UDF.
case class C(C1: String, C2: Int, C3: Boolean)
object CJsonizer {
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.write
implicit val formats = Serialization.formats(org.json4s.NoTypeHints)
def toJSON(c1: String, c2: Int, c3: Boolean) = write(C(c1, c2, c3))
}
val cToJSON = udf((c1: String, c2: Int, c3: Boolean) =>
CJsonizer.toJSON(c1, c2, c3))
df.withColumn("c_json", cToJSON($"C1", $"C2", $"C3"))

Related

getting null values when parsing json column in .csv file using from_json (using spark with scala ver 2.4)

H/a, i am getting null values when using from_json , can you help me figure out the missing piece here.
~ input is the .csv file with json e.g.
id,request
1,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}
2,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}
~my code (scala/spark)
val input_df = spark.read.option("header",true).option("escape","\"").csv(json_file_input)
val json_schema_abc = StructType(Array(
StructField("Zipcode",IntegerType,true),
StructField("ZipCodeType",StringType,true),
StructField("City",StringType,true),
StructField("State",StringType,true))
)
val output_df = input_df.select($"id",from_json(col("request"),json_schema_abc).as("json_request"))
.select("id","json_request.*")
You problem is because the commas in your json column are being used as delimiters. If you have a look at the contents of you input_df:
val input_df = spark.read.option("header",true).option("escape","\"").csv(json_file_input)
input_df.show(false)
+---+--------------+
|id |request |
+---+--------------+
|1 |{"Zipcode":704|
|2 |{"Zipcode":704|
+---+--------------+
You see that the request column is not complete: it was chopped off at the first comma in the request column.
The rest of your code is correct, you can test it like this:
val input_df = Seq(
(1, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""),
(2, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}""")
).toDF("id","request")
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val json_schema_abc = StructType(Array(
StructField("Zipcode",IntegerType,true),
StructField("ZipCodeType",StringType,true),
StructField("City",StringType,true),
StructField("State",StringType,true))
)
val output_df = input_df
.select($"id",from_json(col("request"),json_schema_abc).as("json_request"))
.select("id","json_request.*")
output_df.show(false)
+---+-------+-----------+-------------------+-----+
|id |Zipcode|ZipCodeType|City |State|
+---+-------+-----------+-------------------+-----+
|1 |704 |STANDARD |PARC PARQUE |PR |
|2 |704 |STANDARD |PASEO COSTA DEL SUR|PR |
+---+-------+-----------+-------------------+-----+
I would suggest changing your CSV file's delimiter (for example ; if that does not appear in your data), that way the commas won't be bothering you.

Writing out spark dataframe as nested JSON doc

I have a spark dataframe as:
A B val_of_B val1 val2 val3 val4
"c1" "MCC" "cd1" 1 2 1.1 1.05
"c1" "MCC" "cd2" 2 3 1.1 1.05
"c1" "MCC" "cd3" 3 4 1.1 1.05
val1 and val2 are obtained with group by of A, B and val_of_B where as val3, val4 is A level information only (for example, distinct of A, val3 is only "c1",1.1)
I would like to write this out as nested JSON, which should look like:
For each A, JSON format should look like
{"val3": 1.1, "val4": 1.05, "MCC":[["cd1",1,2], ["cd2",2,3], ["cd3",3,4]]}
Is it possible to accomplish this with existing tools under spark api? If not, can you provide guidelines?
You should groupBy on column A and aggregate necessary columns using first and collect_list and array inbuilt functions
import org.apache.spark.sql.functions._
def zipping = udf((arr1: Seq[String], arr2: Seq[Seq[String]])=> arr1.indices.map(index => Array(arr1(index))++arr2(index)))
val jsonDF = df.groupBy("A")
.agg(first(col("val3")).as("val3"), first(col("val4")).as("val4"), first(col("B")).as("B"), collect_list("val_of_B").as("val_of_B"), collect_list(array("val1", "val2")).as("list"))
.select(col("val3"), col("val4"), col("B"), zipping(col("val_of_B"), col("list")).as("list"))
.toJSON
which should give you
+-----------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","B":"MCC","list":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+-----------------------------------------------------------------------------------------------+
Next is to exchange the list name to value of B using a udf function as
def exchangeName = udf((json: String)=> {
val splitted = json.split(",")
val name = splitted(2).split(":")(1).trim
val value = splitted(3).split(":")(1).trim
splitted(0).trim+","+splitted(1).trim+","+name+":"+value+","+(4 until splitted.size).map(splitted(_)).mkString(",")
})
jsonDF.select(exchangeName(col("value")).as("json"))
.show(false)
which should give you your desired output
+------------------------------------------------------------------------------------+
|json |
+------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","MCC":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+------------------------------------------------------------------------------------+

spark df.write quote all fields but not null values

I am trying to create a csv from values stored in the table:
| col1 | col2 | col3 |
| "one" | null | "one" |
| "two" | "two" | "two" |
hive > select * from table where col2 is null;
one null one
I am getting the csv using the below code:
df.repartition(1)
.write.option("header",true)
.option("delimiter", ",")
.option("quoteAll", true)
.option("nullValue", "")
.csv(S3Destination)
Csv I get:
"col1","col2","col3"
"one","","one"
"two","two","two"
Expected Csv:WITH NO DOUBLE QUOTES FOR NULL VALUE
"col1","col2","col3"
"one",,"one"
"two","two","two"
Any help is appreciated to know if the dataframe writer has options to do this.
You can go in a udf approach and apply on the column (using withColumn on the repartitioned datafrmae above) where possiblity of double quote empty string is there see below sample code
sqlContext.udf().register("convertToEmptyWithOutQuotes",(String abc) -> (abc.trim().length() > 0 ? abc : abc.replace("\"", " ")),DataTypes.StringType);
String has replace method which does the job.
val a = Array("'x'","","z")
println(a.mkString(",").replace("\"", " "))
will produce 'x',,z

Why does reading csv file with empty values lead to IndexOutOfBoundException?

I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
When I tried to perform an action on rowRDD, gives the error.
Any help is greatly appreciated.
This is not answer to your question. But it may help to solve your problem.
From the question I see that you are trying to create a dataframe from a CSV.
Creating dataframe using CSV can be easily done using spark-csv package
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
You can also inferSchema with latest version. See this answer
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):
David,1,2,10,,11
The problem is your CSV file contains 6 columns, yet with:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
You try to read 7 columns. Just change your mapping to:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
And Spark will take care of the rest.
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it
David,1,2,10,,11
You can read the csv file as text file as follow
fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
And then you can use your code to create dataframe from it
You can do it as follows.
val df = sqlContext
.read
.textfile(csvFilePath)
.map(_.split(delimiter_of_file, -1)
.map(
p =>
Row(
p(0),
p(1),
p(2),
p(3),
p(4),
p(5),
p(6))
Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.

Summarizing/aggregating a Scala Slick object into another

I'm essentially trying to recreate the following SQL query using Scala Slick:
select labelOne, labelTwo, sum(countA), sum(countB) from things where date > 'blah' group by labelOne, labelTwo;
As you can see, it takes what a table of labeled things and aggregates them, summing various counts. A table with the following info:
ID | date | labelOne | labelTwo | countA | countB
-------------------------------------------------
0 | 0 | foo | cheese | 1 | 2
1 | 0 | bar | wine | 0 | 3
2 | 1 | foo | cheese | 3 | 4
3 | 1 | bar | wine | 2 | 1
4 | 2 | foo | beer | 1 | 1
Should yield the following result if queried across all dates:
labelOne | labelTwo | countA | countB
-------------------------------------
foo | cheese | 4 | 6
bar | wine | 2 | 4
foo | beer | 1 | 1
This is what my Scala code looks like:
import scala.slick.driver.MySQLDriver.simple._
import scala.slick.jdbc.StaticQuery
import StaticQuery.interpolation
import org.joda.time.LocalDate
import com.github.tototoshi.slick.JodaSupport._
case class Thing(
id: Option[Long],
date: LocalDate,
labelOne: String,
labelTwo: String,
countA: Long,
countB: Long)
// summarized version of "Thing": note there's no date in this object
// each distinct grouping of Thing.labelOne + Thing.labelTwo should become a "SummarizedThing", with summed counts
case class SummarizedThing(
labelOne: String,
labelTwo: String,
countASum: Long,
countBSum: Long)
trait ThingsComponent {
val Things: Things
class Things extends Table[Thing]("things") {
def id = column[Long]("id", O.PrimaryKey, O.AutoInc)
def date = column[LocalDate]("date", O.NotNull)
def labelOne = column[String]("labelOne", O.NotNull)
def labelTwo = column[String]("labelTwo", O.NotNull)
def countA = column[Long]("countA", O.NotNull)
def countB = column[Long]("countB", O.NotNull)
def * = id.? ~ date ~ labelOne ~ labelTwo ~ countA ~ countB <> (Thing.apply _, Thing.unapply _)
val byId = createFinderBy(_.id)
}
}
object Things extends DAO {
def insert(thing: Thing)(implicit s: Session) { Things.insert(thing) }
def findById(id: Long)(implicit s: Session): Option[Thing] = Things.byId(id).firstOption
// ???
def summarizeSince(date: LocalDate)(implicit s: Session): Set[SummarizedThing] = {
Query(Things).where(_.date > date).groupBy(x => (x.labelOne, x.labelTwo)).map {
case(thing: Thing) => {
// obviously this line below is wrong, but you can get an idea of what I'm trying to accomplish:
// create a new SummarizedThing for each unique labelOne + labelTwo combo, summing the count columns
new SummarizedThing(thing.labelOne, thing.labelTwo, thing.countA.sum, thing.countB.sum)
}
} // presumably need to run the query and map to SummarizedThing here, perhaps?
}
}
The summarizeSince function is where I'm having trouble. I seem to be able to query Things just fine, filtering by date, and grouping by my fields... however, I'm having trouble summing countA and countB. With the summed results, I'd then like to create a SummarizedThing for each unique labelOne + labelTwo combination. Hopefully that makes sense. Any help would be greatly appreciated.
presumably need to run the query and map to SummarizedThing here, perhaps?
Exactly.
Query(Things).filter(_.date > date).groupBy(x => (x.labelOne, x.labelTwo)).map {
// match on (key,group)
case ((labelOne, labelTwo), things) => {
// prepare results as tuple (note .sum returns an Option)
(labelOne, labelTwo, things.map(_.countA).sum.get, things.map(_.countB).sum.get)
}
}.run.map(SummarizedThing.tupled) // run and map tuple into case class
Same as the other answer, but expressed as a for comprehension, except that .get is exceptional so you probably need getOrElse.
val q = for {
((l1,l2), ts) <- Things.where(_.date > date).groupBy(t => (t.labelOne, t.labelTwo))
} yield (l1, l2, ts.map(_.countA).sum.getOrElse(0L), ts.map(_.countB).sum.getOrElse(0L))
// see the SQL that generates.
println( q.selectStatement )
// select x2.`labelOne`, x2.`labelTwo`, sum(x2.`countA`), sum(x2.`countB`)
// from `things` x2 where x2.`date` > '2013' group by x2.`labelOne`, x2.`labelTwo`
// map the result(s) of your query to your case class
q.map(SummarizedThing.tupled).list