Extract Options from potentially null JSON values using for expression - json

I have a JSON document where some values can be null. Using for expressions in json4s, how I can yield None, instead of nothing?
The following will fail to yield when the value for either of the fields FormattedID or PlanEstimate is null.
val j: json4s.JValue = ...
for {
JObject(list) <- j
JField("FormattedID", JString(id)) <- list
JField("PlanEstimate", JDouble(points)) <- list
} yield (id, points)
For example:
import org.json4s._
import org.json4s.jackson.JsonMethods._
scala> parse("""{
| "FormattedID" : "the id",
| "PlanEstimate" : null
| }""")
res1: org.json4s.JValue = JObject(List((FormattedID,JString(the id)),
(PlanEstimate,JNull)))
scala> for {
| JObject(thing) <- res1
| JField("FormattedID", JString(id)) <- thing
| } yield id
res2: List[String] = List(the id)
scala> for {
| JObject(thing) <- res1
| JField("PlanEstimate", JDouble(points)) <- thing
| } yield points
res3: List[Double] = List()
// Ideally res3 should be List[Option[Double]] = List(None)

scala> object OptExtractors {
|
| // Define a custom extractor for using in for-comprehension.
| // It returns Some[Option[Double]], instead of Option[Double].
| object JDoubleOpt {
| def unapply(e: Any) = e match {
| case d: JDouble => Some(JDouble.unapply(d))
| case _ => Some(None)
| }
| }
| }
defined object OptExtractors
scala>
scala> val j = parse("""{
| "FormattedID" : "the id",
| "PlanEstimate" : null
| }""")
j: org.json4s.JValue = JObject(List((FormattedID,JString(the id)), (PlanEstimate,JNull)))
scala>
scala> import OptExtractors._
import OptExtractors._
scala>
scala> for {
| JObject(list) <- j
| JField("FormattedID", JString(id)) <- list
| JField("PlanEstimate", JDoubleOpt(points)) <- list
| } yield (id, points)
res1: List[(String, Option[Double])] = List((the id,None))

According to the documentation,
Any value can be optional. Field and value is completely removed when
it doesn't have a value.
scala> val json = ("name" -> "joe") ~ ("age" -> (None: Option[Int]))
scala> compact(render(json))
res4: String = {"name":"joe"}
Explaining why your for comprehension doesn't yield anything.
Of course, a null value is mapped to None internally.

The last command should look like:
for {
JObject(thing) <- res1
} yield thing.collectFirst{case JField("PlanEstimate", JDouble(points)) => points}
Or like
for {
JObject(thing) <- res1
points = thing.collectFirst{case JField("PlanEstimate", JDouble(p)) => p}
} yield points

What about this
for {
JObject(thing) <- res1
x = thing.find(_._1 == "PlanEstimate").flatMap(_._2.toOption.map(_.values))
} yield x

Related

How to split Array of Json DataFrame into multiple possible number of rows in Scala

How can I split an Array of Json DataFrame to multiple rows in Spark-Scala
Input DataFrame :
+----------+-------------+-----------------------------------------------------------------------------------------------------------------------------+
|item_id |s_tag |jsonString |
+----------+-------------+-----------------------------------------------------------------------------------------------------------------------------+
|Item_12345|S_12345|[{"First":{"Info":"ABCD123","Res":"5.2"}},{"Second":{"Info":"ABCD123","Res":"5.2"}},{"Third":{"Info":"ABCD123","Res":"5.2"}}] |
+----------+-------------+-----------------------------------------------------------------------------------------------------------------------------+
Output DataFrame :
+----------+-------------------------------------------------+
|item_id |s_tag |jsonString |
+----------+-------------------------------------------------+
|Item_12345|S_12345|{"First":{"Info":"ABCD123","Res":"5.2"}} |
+----------+-------------------------------------------------+
|Item_12345|S_12345|{"Second":{"Info":"ABCD123","Res":"5.2"}}|
+----------+-------------------------------------------------+
|Item_12345|S_12345|{"Third":{"Info":"ABCD123","Res":"5.2"}} |
+----------+-------------------------------------------------+
This is what I have tried so far but it did not work
val rawDF = sparkSession
.sql("select 1")
.withColumn("item_id", lit("Item_12345")).withColumn("s_tag", lit("S_12345"))
.withColumn("jsonString", lit("""[{"First":{"Info":"ABCD123","Res":"5.2"}},{"Second":{"Info":"ABCD123","Res":"5.2"}},{"Third":{"Info":"ABCD123","Res":"5.2"}}]"""))
val newDF = RawDF.withColumn("splittedJson", explode(RawDF.col("jsonString")))
The issue in the example code you posted is that the json is represented as a string and hence cannot be exploded. Try something like this:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{typedLit, _}
object tmp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val arr = Seq("{\"First\":{\"Info\":\"ABCD123\",\"Res\":\"5.2\"}}",
"{\"Second\":{\"Info\":\"ABCD123\",\"Res\":\"5.2\"}}",
"{\"Third\":{\"Info\":\"ABCD123\",\"Res\":\"5.2\"}}")
val rawDF = spark.sql("select 1")
.withColumn("item_id", lit("Item_12345"))
.withColumn("s_tag", lit("S_12345"))
.withColumn("jsonString", typedLit(arr))
val newDF = rawDF.withColumn("splittedJson", explode(rawDF.col("jsonString")))
newDF.show()
}
}

Spark Row to JSON

I would like to create a JSON from a Spark v.1.6 (using scala) dataframe. I know that there is the simple solution of doing df.toJSON.
However, my problem looks a bit different. Consider for instance a dataframe with the following columns:
| A | B | C1 |  C2 | C3 |
-------------------------------------------
| 1 | test | ab | 22 | TRUE |
| 2 | mytest | gh | 17 | FALSE |
I would like to have at the end a dataframe with
| A | B | C |
----------------------------------------------------------------
| 1 | test | { "c1" : "ab", "c2" : 22, "c3" : TRUE } |
| 2 | mytest | { "c1" : "gh", "c2" : 17, "c3" : FALSE } |
where C is a JSON containing C1, C2, C3. Unfortunately, I at compile time I do not know what the dataframe looks like (except the columns A and B that are always "fixed").
As for the reason why I need this: I am using Protobuf for sending around the results. Unfortunately, my dataframe sometimes has more columns than expected and I would still send those via Protobuf, but I do not want to specify all columns in the definition.
How can I achieve this?
Spark 2.1 should have native support for this use case (see #15354).
import org.apache.spark.sql.functions.to_json
df.select(to_json(struct($"c1", $"c2", $"c3")))
I use this command to solve the to_json problem:
output_df = (df.select(to_json(struct(col("*"))).alias("content")))
Here, no JSON parser, and it adapts to your schema:
import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
df.select(
col(df.columns(0)),
col(df.columns(1)),
concat(
lit("{"),
concat_ws(",",df.dtypes.slice(2, df.dtypes.length).map(dt => {
val c = dt._1;
val t = dt._2;
concat(
lit("\"" + c + "\":" + (if (t == "StringType") "\""; else "") ),
col(c),
lit(if(t=="StringType") "\""; else "")
)
}):_*),
lit("}")
) as "C"
).collect()
First lets convert C's to a struct:
val dfStruct = df.select($"A", $"B", struct($"C1", $"C2", $"C3").alias("C"))
This is structure can be converted to JSONL using toJSON as before:
dfStruct.toJSON.collect
// Array[String] = Array(
// {"A":1,"B":"test","C":{"C1":"ab","C2":22,"C3":true}},
// {"A":2,"B":"mytest","C":{"C1":"gh","C2":17,"C3":false}})
I am not aware of any built-in method that can convert a single column but you can either convert it individually and join or use your favorite JSON parser in an UDF.
case class C(C1: String, C2: Int, C3: Boolean)
object CJsonizer {
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.write
implicit val formats = Serialization.formats(org.json4s.NoTypeHints)
def toJSON(c1: String, c2: Int, c3: Boolean) = write(C(c1, c2, c3))
}
val cToJSON = udf((c1: String, c2: Int, c3: Boolean) =>
CJsonizer.toJSON(c1, c2, c3))
df.withColumn("c_json", cToJSON($"C1", $"C2", $"C3"))

Summarizing/aggregating a Scala Slick object into another

I'm essentially trying to recreate the following SQL query using Scala Slick:
select labelOne, labelTwo, sum(countA), sum(countB) from things where date > 'blah' group by labelOne, labelTwo;
As you can see, it takes what a table of labeled things and aggregates them, summing various counts. A table with the following info:
ID | date | labelOne | labelTwo | countA | countB
-------------------------------------------------
0 | 0 | foo | cheese | 1 | 2
1 | 0 | bar | wine | 0 | 3
2 | 1 | foo | cheese | 3 | 4
3 | 1 | bar | wine | 2 | 1
4 | 2 | foo | beer | 1 | 1
Should yield the following result if queried across all dates:
labelOne | labelTwo | countA | countB
-------------------------------------
foo | cheese | 4 | 6
bar | wine | 2 | 4
foo | beer | 1 | 1
This is what my Scala code looks like:
import scala.slick.driver.MySQLDriver.simple._
import scala.slick.jdbc.StaticQuery
import StaticQuery.interpolation
import org.joda.time.LocalDate
import com.github.tototoshi.slick.JodaSupport._
case class Thing(
id: Option[Long],
date: LocalDate,
labelOne: String,
labelTwo: String,
countA: Long,
countB: Long)
// summarized version of "Thing": note there's no date in this object
// each distinct grouping of Thing.labelOne + Thing.labelTwo should become a "SummarizedThing", with summed counts
case class SummarizedThing(
labelOne: String,
labelTwo: String,
countASum: Long,
countBSum: Long)
trait ThingsComponent {
val Things: Things
class Things extends Table[Thing]("things") {
def id = column[Long]("id", O.PrimaryKey, O.AutoInc)
def date = column[LocalDate]("date", O.NotNull)
def labelOne = column[String]("labelOne", O.NotNull)
def labelTwo = column[String]("labelTwo", O.NotNull)
def countA = column[Long]("countA", O.NotNull)
def countB = column[Long]("countB", O.NotNull)
def * = id.? ~ date ~ labelOne ~ labelTwo ~ countA ~ countB <> (Thing.apply _, Thing.unapply _)
val byId = createFinderBy(_.id)
}
}
object Things extends DAO {
def insert(thing: Thing)(implicit s: Session) { Things.insert(thing) }
def findById(id: Long)(implicit s: Session): Option[Thing] = Things.byId(id).firstOption
// ???
def summarizeSince(date: LocalDate)(implicit s: Session): Set[SummarizedThing] = {
Query(Things).where(_.date > date).groupBy(x => (x.labelOne, x.labelTwo)).map {
case(thing: Thing) => {
// obviously this line below is wrong, but you can get an idea of what I'm trying to accomplish:
// create a new SummarizedThing for each unique labelOne + labelTwo combo, summing the count columns
new SummarizedThing(thing.labelOne, thing.labelTwo, thing.countA.sum, thing.countB.sum)
}
} // presumably need to run the query and map to SummarizedThing here, perhaps?
}
}
The summarizeSince function is where I'm having trouble. I seem to be able to query Things just fine, filtering by date, and grouping by my fields... however, I'm having trouble summing countA and countB. With the summed results, I'd then like to create a SummarizedThing for each unique labelOne + labelTwo combination. Hopefully that makes sense. Any help would be greatly appreciated.
presumably need to run the query and map to SummarizedThing here, perhaps?
Exactly.
Query(Things).filter(_.date > date).groupBy(x => (x.labelOne, x.labelTwo)).map {
// match on (key,group)
case ((labelOne, labelTwo), things) => {
// prepare results as tuple (note .sum returns an Option)
(labelOne, labelTwo, things.map(_.countA).sum.get, things.map(_.countB).sum.get)
}
}.run.map(SummarizedThing.tupled) // run and map tuple into case class
Same as the other answer, but expressed as a for comprehension, except that .get is exceptional so you probably need getOrElse.
val q = for {
((l1,l2), ts) <- Things.where(_.date > date).groupBy(t => (t.labelOne, t.labelTwo))
} yield (l1, l2, ts.map(_.countA).sum.getOrElse(0L), ts.map(_.countB).sum.getOrElse(0L))
// see the SQL that generates.
println( q.selectStatement )
// select x2.`labelOne`, x2.`labelTwo`, sum(x2.`countA`), sum(x2.`countB`)
// from `things` x2 where x2.`date` > '2013' group by x2.`labelOne`, x2.`labelTwo`
// map the result(s) of your query to your case class
q.map(SummarizedThing.tupled).list

Difference between these two method definitions

What's the difference between these two definitions?:
def sayTwords(word1: String, word2: String) = println(word1 + " " + word2)
def sayTwords2(word1: String)(word2: String) = println(word1 + " " + word2)
What is the purpose of each?
The second is curried, the first isn't. For a discussion of why you might choose to curry a method, see What's the rationale behind curried functions in Scala?
sayTwords2 allows the method to be partially applied.
val sayHelloAnd = sayTwords2("Hello")
sayHelloAnd("World!")
sayHaelloAnd("Universe!")
Note you can also use the first function in the same way.
val sayHelloAnd = sayTwords("Hello", _:String)
sayHelloAnd("World!")
sayHelloAnd("Universe!")
def sayTwords(word1: String, word2: String) = println(word1 + " " + word2)
def sayTwords2(word1: String)(word2: String) = println(word1 + " " + word2)
The first contains a single parameter list. The second contains multiple parameter lists.
They differ in following regards:
Partial application syntax. Observe:
scala> val f = sayTwords("hello", _: String)
f: String => Unit = <function1>
scala> f("world")
hello world
scala> val g = sayTwords2("hello") _
g: String => Unit = <function1>
scala> g("world")
hello world
The former has a benefit of being positional syntax. Thus you can partially apply arguments in any positions.
Type inference. The type inference in Scala works per parameter list, and goes from left to right. So given a case, one might facilitate better type inference than other. Observe:
scala> def unfold[A, B](seed: B, f: B => Option[(A, B)]): Seq[A] = {
| val s = Seq.newBuilder[A]
| var x = seed
| breakable {
| while (true) {
| f(x) match {
| case None => break
| case Some((r, x0)) => s += r; x = x0
| }
| }
| }
| s.result
| }
unfold: [A, B](seed: B, f: B => Option[(A, B)])Seq[A]
scala> unfold(11, x => if (x == 0) None else Some((x, x - 1)))
<console>:18: error: missing parameter type
unfold(11, x => if (x == 0) None else Some((x, x - 1)))
^
scala> unfold(11, (x: Int) => if (x == 0) None else Some((x, x - 1)))
res7: Seq[Int] = List(11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
scala> def unfold[A, B](seed: B)(f: B => Option[(A, B)]): Seq[A] = {
| val s = Seq.newBuilder[A]
| var x = seed
| breakable {
| while (true) {
| f(x) match {
| case None => break
| case Some((r, x0)) => s += r; x = x0
| }
| }
| }
| s.result
| }
unfold: [A, B](seed: B)(f: B => Option[(A, B)])Seq[A]
scala> unfold(11)(x => if (x == 0) None else Some((x, x - 1)))
res8: Seq[Int] = List(11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1)

How to return a function in scala

How can I return a function side-effecting lexical closure1 in Scala?
For instance, I was looking at this code sample in Go:
...
// fib returns a function that returns
// successive Fibonacci numbers.
func fib() func() int {
a, b := 0, 1
return func() int {
a, b = b, a+b
return b
}
}
...
println(f(), f(), f(), f(), f())
prints
1 2 3 5 8
And I can't figure out how to write the same in Scala.
1. Corrected after Apocalisp comment
Slightly shorter, you don't need the return.
def fib() = {
var a = 0
var b = 1
() => {
val t = a;
a = b
b = t + b
b
}
}
Gah! Mutable variables?!
val fib: Stream[Int] =
1 #:: 1 #:: (fib zip fib.tail map Function.tupled(_+_))
You can return a literal function that gets the nth fib, for example:
val fibAt: Int => Int = fib drop _ head
EDIT: Since you asked for the functional way of "getting a different value each time you call f", here's how you would do that. This uses Scalaz's State monad:
import scalaz._
import Scalaz._
def uncons[A](s: Stream[A]) = (s.tail, s.head)
val f = state(uncons[Int])
The value f is a state transition function. Given a stream, it will return its head, and "mutate" the stream on the side by taking its tail. Note that f is totally oblivious to fib. Here's a REPL session illustrating how this works:
scala> (for { _ <- f; _ <- f; _ <- f; _ <- f; x <- f } yield x)
res29: scalaz.State[scala.collection.immutable.Stream[Int],Int] = scalaz.States$$anon$1#d53513
scala> (for { _ <- f; _ <- f; _ <- f; x <- f } yield x)
res30: scalaz.State[scala.collection.immutable.Stream[Int],Int] = scalaz.States$$anon$1#1ad0ff8
scala> res29 ! fib
res31: Int = 5
scala> res30 ! fib
res32: Int = 3
Clearly, the value you get out depends on the number of times you call f. But this is all purely functional and therefore modular and composable. For example, we can pass any nonempty Stream, not just fib.
So you see, you can have effects without side-effects.
While we're sharing cool implementations of the fibonacci function that are only tangentially related to the question, here's a memoized version:
val fib: Int => BigInt = {
def fibRec(f: Int => BigInt)(n: Int): BigInt = {
if (n == 0) 1
else if (n == 1) 1
else (f(n-1) + f(n-2))
}
Memoize.Y(fibRec)
}
It uses the memoizing fixed-point combinator implemented as an answer to this question: In Scala 2.8, what type to use to store an in-memory mutable data table?
Incidentally, the implementation of the combinator suggests a slightly more explicit technique for implementing your function side-effecting lexical closure:
def fib(): () => Int = {
var a = 0
var b = 1
def f(): Int = {
val t = a;
a = b
b = t + b
b
}
f
}
Got it!! after some trial and error:
def fib() : () => Int = {
var a = 0
var b = 1
return (()=>{
val t = a;
a = b
b = t + b
b
})
}
Testing:
val f = fib()
println(f(),f(),f(),f())
1 2 3 5 8
You don't need a temp var when using a tuple:
def fib() = {
var t = (1,-1)
() => {
t = (t._1 + t._2, t._1)
t._1
}
}
But in real life you should use Apocalisp's solution.