Turn string into simple JSON in scala - json

I have a string in scala which in terms of formatting, it is a json, for example
{"name":"John", "surname":"Doe"}
But when I generate this value it is initally a string. I need to convert this string into a json but I cannot change the output of the source. So how can I do this conversion in Scala? (I cannot use the Play Json library.)

If you have strings as
{"name":"John", "surname":"Doe"}
and if you want to save to elastic as mentioned here then you should use parseRaw instead of parseFull.
parseRaw will return you JSONType and parseFull will return you map
You can do as following
import scala.util.parsing.json._
val jsonString = "{\"name\":\"John\", \"surname\":\"Doe\"}"
val parsed = JSON.parseRaw(jsonString).get.toString()
And then use the jsonToEs api as
sc.makeRDD(Seq(parsed)).saveJsonToEs("spark/json-trips")
Edited
As #Aivean pointed out, when you already have json string from source, you won't be needing to convert to json, you can just do
if jsonString is {"name":"John", "surname":"Doe"}
sc.makeRDD(Seq(jsonString)).saveJsonToEs("spark/json-trips")

You can use scala.util.parsing.json to convert JSON in string format to JSON (which is basically HashMap datastructure),
eg.
scala> import scala.util.parsing.json._
import scala.util.parsing.json._
scala> val json = JSON.parseFull("""{"name":"John", "surname":"Doe"}""")
json: Option[Any] = Some(Map(name -> John, surname -> Doe))
To navigate the json format,
scala> json match { case Some(jsonMap : Map[String, Any]) => println(jsonMap("name")) case _ => println("json is empty") }
John
nested json example,
scala> val userJsonString = """{"name":"John", "address": { "perm" : "abc", "temp" : "zyx" }}"""
userJsonString: String = {"name":"John", "address": { "perm" : "abc", "temp" : "zyx" }}
scala> val json = JSON.parseFull(userJsonString)
json: Option[Any] = Some(Map(name -> John, address -> Map(perm -> abc, temp -> zyx)))
scala> json.map(_.asInstanceOf[Map[String, Any]]("address")).map(_.asInstanceOf[Map[String, String]]("perm"))
res7: Option[String] = Some(abc)

Related

purescript-argonaut: Decode arbitrary key-value json

Is there a way to decode arbitrary json (e.g: We don't know the keys at compile time)?
For example, I need to parse the following json:
{
"Foo": [
"Value 1",
"Value 2"
],
"Bar": [
"Bar Value 1"
],
"Baz": []
}
where the names and number of keys are not known at compile time and may change per GET request. The goal is basically to decode this into a Map String (Array String) type
Is there a way to do this using purescript-argonaut?
You can totally write your own by first parsing the string into Json via jsonParser, and then examining the resulting data structure with the various combinators provided by Argonaut.
But the quickest and simplest way, I think, is to parse it into Foreign.Object (Array String) first, and then convert to whatever your need, like Map String (Array String):
import Data.Argonaut (decodeJson, jsonParser)
import Data.Either (Either)
import Data.Map as Map
import Foreign.Object as F
decodeAsMap :: String -> Either _ (Map.Map String (Array String))
decodeAsMap str = do
json <- jsonParser str
obj <- decodeJson json
pure $ Map.fromFoldable $ (F.toUnfoldable obj :: Array _)
The Map instance of EncodeJSON will generate an array of tuple, you can manually construct a Map and see the encoded json.
let v = Map.fromFoldable [ Tuple "Foo" ["Value1", "Value2"] ]
traceM $ encodeJson v
Output should be [ [ 'Foo', [ 'Value1', 'Value2' ] ] ].
To do the reverse, you need to transform you object to an array of tuple, Object.entries can help you.
An example
// Main.js
var obj = {
foo: ["a", "b"],
bar: ["c", "d"]
};
exports.tuples = Object.entries(obj);
exports.jsonString = JSON.stringify(exports.tuples);
-- Main.purs
module Main where
import Prelude
import Data.Argonaut.Core (Json)
import Data.Argonaut.Decode (decodeJson)
import Data.Argonaut.Parser (jsonParser)
import Data.Either (Either)
import Data.Map (Map)
import Debug.Trace (traceM)
import Effect (Effect)
import Effect.Console (log)
foreign import tuples :: Json
foreign import jsonString :: String
main :: Effect Unit
main = do
let
a = (decodeJson tuples) :: Either String (Map String (Array String))
b = (decodeJson =<< jsonParser jsonString) :: Either String (Map String (Array String))
traceM a
traceM b
traceM $ a == b

Spark from_json with dynamic schema

I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch could be more than 20 GB.
Entire batch has been generated from 30 data sources and 'key2' of each JSON can be used to identify the source and structure for each source is predefined.
What would be the best approach for processing such data?
I have tried using from_json like below but it works only with fixed schema and to use it first I need to group the data based on each source and then apply the schema.
Due to large data volume my preferred choice is to scan the data only once and extract required values from each source, based on predefined schema.
import org.apache.spark.sql.types._
import spark.implicits._
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
val df = data.toDF
val schema = (new StructType)
.add("key1", StringType)
.add("key2", StringType)
.add("key3", (new StructType)
.add("key3_k1", StringType))
df.select(from_json($"value",schema).as("json_str"))
.select($"json_str.key3.key3_k1").collect
res17: Array[org.apache.spark.sql.Row] = Array([xxx])
This is just a restatement of #Ramesh Maharjan's answer, but with more modern Spark syntax.
I found this method lurking in DataFrameReader which allows you to parse JSON strings from a Dataset[String] into an arbitrary DataFrame and take advantage of the same schema inference Spark gives you with spark.read.json("filepath") when reading directly from a JSON file. The schema of each row can be completely different.
def json(jsonDataset: Dataset[String]): DataFrame
Example usage:
val jsonStringDs = spark.createDataset[String](
Seq(
("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))
jsonStringDs.show
jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
+----------------------------------------------------------------------------------------------------------------------+
|value
|
+----------------------------------------------------------------------------------------------------------------------+
|{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
|{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"} |
+----------------------------------------------------------------------------------------------------------------------+
val df = spark.read.json(jsonStringDs)
df.show(false)
df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
+----------+------------------+-------------+---------+--------+------------+------+------------+
|CEO |address |employeeCount|firstname|lastname|marketCap |name |revenue |
+----------+------------------+-------------+---------+--------+------------+------+------------+
|null |[London,Baker,121]|null |Sherlock |Holmes |null |null |null |
|Jeff Bezos|null |500000 |null |null |817117000000|Amazon|177900000000|
+----------+------------------+-------------+---------+--------+------------+------+------------+
The method is available from Spark 2.2.0:
http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader#json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame
If you have data as you mentioned in the question as
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below
val df = sqlContext.read.json(data)
which will give you schema as below for the rdd data used above
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: struct (nullable = true)
| |-- key3_k1: string (nullable = true)
And you can just select key3_k1 as
df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+
You can manipulate the dataframe as you wish. I hope the answer is helpful
I am not sure if my suggestion can help you although I had a similar case and I solved it as follows:
1) So the idea is to use json rapture (or some other json library) to
load JSON schema dynamically. For instance you could read the 1st
row of the json file to discover the schema(similarly to what I do
here with jsonSchema)
2) Generate schema dynamically. First iterate through the dynamic
fields (notice that I project values of key3 as Map[String, String])
and add a StructField for each one of them to schema
3) Apply the generated schema into your dataframe
import rapture.json._
import jsonBackends.jackson._
val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
val json = Json.parse(jsonSchema)
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.{StringType, StructType}
val schema = ArrayBuffer[StructField]()
//we could do this dynamic as well with json rapture
schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))
val items = ArrayBuffer[StructField]()
json.key3.as[Map[String, String]].foreach{
case(k, v) => {
items.append(StructField(k, StringType))
}
}
val complexColumn = new StructType(items.toArray)
schema.append(StructField("key3", complexColumn))
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")
jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()
I used the next data as input:
{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
{"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
{"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}
And the output:
+----+-------+--------+--------+--------+
|key1| key2| key3_k1| key3_k2| key3_k3|
+----+-------+--------+--------+--------+
|val1|source1|key3_v11|key3_v21|key3_v31|
|val2|source2|key3_v12|key3_v22|key3_v32|
|val2|source3|key3_v13|key3_v23|key3_v33|
+----+-------+--------+--------+--------+
An advanced alternative, which I haven't tested yet, would be to generate a case class e.g called JsonRow from the JSON schema in order to have a strongly typed dataset which provides better serialization performance apart the fact that make your code more maintainable. To make this work you need first to create a JsonRow.scala file then you should implement a sbt pre-build script which will modify the content of JsonRow.scala(you might have more than one of course) dynamically based on your source files. To generate class JsonRow dynamically you can use the next code:
def generateClass(members: Map[String, String], name: String) : Any = {
val classMembers = for (m <- members) yield {
s"${m._1}: String"
}
val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
classDef
}
The method generateClass accepts a map of strings to create the class members and the class name itself. The members of the generated class you can again populate them from you json schema:
import org.codehaus.jackson.node.{ObjectNode, TextNode}
import collection.JavaConversions._
val mapping = collection.mutable.Map[String, String]()
val fields = json.$root.value.asInstanceOf[ObjectNode].getFields
for (f <- fields) {
(f.getKey, f.getValue) match {
case (k: String, v: TextNode) => mapping(k) = v.asText
case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
case _ => None
}
}
val dynClass = generateClass(mapping.toMap, "JsonRow")
println(dynClass)
This prints out:
case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass
Good luck

Spark from_json - StructType and ArrayType

I have a data set that comes in as XML, and one of the nodes contains JSON. Spark is reading this in as a StringType, so I am trying to use from_json() to convert the JSON to a DataFrame.
I am able to convert a string of JSON, but how do I write the schema to work with an Array?
String without Array - Working nicely
import org.apache.spark.sql.functions._
val schemaExample = new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
val dfExample = spark.sql("""select "{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }" as theJson""")
val dfICanWorkWith = dfExample.select(from_json($"theJson", schemaExample))
dfICanWorkWith.collect()
// Results \\
res19: Array[org.apache.spark.sql.Row] = Array([[Johnny,Boy]])
String with an Array - Can't figure this one out
import org.apache.spark.sql.functions._
val schemaExample2 = new StructType()
.add("", ArrayType(new StructType()
.add("FirstName", StringType)
.add("Surname", StringType)
)
)
val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")
val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))
dfICanWorkWith.collect()
// Result \\
res22: Array[org.apache.spark.sql.Row] = Array([null])
The problem is that you don't have a fully qualified json. Your json is missing a couple of things:
First you are missing the surrounding {} in which the json is done
Second you are missing the variable value (you set it as "" but did not add it)
Lastly you are missing the closing ]
Try replacing it with:
val dfExample2= spark.sql("""select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson""")
and you will get:
scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])
as of spark 2.4 the schema_of_json function helps:
> SELECT schema_of_json('[{"col":0}]');
array<struct<col:int>>
in your case you can then use the below code to parse that array of son objects:
scala> spark.sql("""select from_json("[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]", 'array<struct<FirstName:string,Surname:string>>' ) as theJson""").show(false)
+------------------------------+
|theJson |
+------------------------------+
|[[Johnny, Boy], [Franky, Man]]|
+------------------------------+

json4s JValue expected (String,String) given

Using Scala and json4s (Maybe Im missing a golden fish library or something)
I am trying to add some list (or array) of strings to a JSON so in the end looks like:
{"already":"here",..."listToAdd":["a","b",c"]}
The fact is that I already have the String in a JObject and the list of strings in an Array[String] (but it could be changed to List if needed). So I followed the docat json4s.org which states that:
Any seq produces JSON array.
scala> val json = List(1, 2, 3)
scala> compact(render(json))
res0: String = [1,2,3]
Tuple2[String, A] produces field.
scala> val json = ("name" -> "joe")
scala> compact(render(json))
res1: String = {"name":"joe"}
And when trying it, it gives:
Error:(15, 28) type mismatch;
found : (String, String)
required: org.json4s.JValue
which expands to) org.json4s.JsonAST.JValue
println(compact(render(idJSON)))
Using Scala 2.11.4
Json4s 3.2.11 (Jackson)
You have to additionally import some implicit conversion methods:
import org.json4s.JsonDSL._
These will convert the Scala objects into the library's Json AST.

Serializing a JSON string as JSON in Scala/Play

I have a String with some arbitrary JSON in it. I want to construct a JsObject with my JSON string as a JSON object value, not a string value. For example, assuming my arbitrary string is a boring {} I want {"key": {}} and not {"key": "{}"}.
Here's how I'm trying to do it.
val myString = "{}"
Json.obj(
"key" -> Json.parse(myString)
)
The error I get is
type mismatch; found :
scala.collection.mutable.Buffer[scala.collection.immutable.Map[String,java.io.Serializable]]
required: play.api.libs.json.Json.JsValueWrapper
I'm not sure what to do about that.
"{}" is an empty object.
So, to get {"key": {}} :
Json.obj("key" -> Json.obj())
Update:
Perhaps you have an old version of Play. This works under Play 2.3.x:
scala> import play.api.libs.json._
scala> Json.obj("foo" -> Json.parse("{}"))
res2: play.api.libs.json.JsObject = {"foo":{}}