How does I deserialise data like this into a case class like this:
case class SoundCloudUser (
id: Int,
permalink: String,
username: String,
country: String,
full_name: String,
city: String,
description: String)
(that is, where the case class has less constructor arguments than the JSON has values)
I tried creating a FieldSerializer to do this, but I could only work out how to ignore fields when serialising, not deserialising.
As long as the fields in the JSON data are a superset of the fields in your case class, you don't need to do anything special to ignore the fields in the JSON data that aren't in your case class. It should "just work". Are you getting any errors?
Related
I got a working code of siddhi and i want to know if its posible to output the events using a json format without an enclosing element.
I tried it putting a null enclosing.element and $. , but none of them seems to work.
#sink(type = 'file', file.uri = "/var/log/cert/output/{{name}}",
#map(type = 'json', fail.on.missing.attibute = "false",enclosing.element="$."))
define stream AlertStream (timestamp long, name string, ipsrc string, ipdst string, evento string, tipoAmenaza string, eventCategory string, severity string, network string, threatId string, eventTech string, eventArea string, urlOriginal string, eventID string, tag string);
i got the following result
{"event":{"timestamp":1562232334157,"name":"client_name","ipsrc":"192.168.1.1","ipdst":"192.168.1.2","evento":"threat","tipoAmenaza":"file","eventCategory":"alert","severity":"medium","network":"192.168.0.0-192.168.255.255","threatId":"spyware","eventTech":"firewall","eventArea":"fwaas","urlOriginal":"undefined","eventID":"901e1155-5407-48ce-bddb-c7469fcf5c48","tag":"[Spyware-fwaas]"}}
and the expect output is
{"timestamp":1562232334157,"name":"client_name","ipsrc":"192.168.1.1","ipdst":"192.168.1.2","evento":"threat","tipoAmenaza":"file","eventCategory":"alert","severity":"medium","network":"192.168.0.0-192.168.255.255","threatId":"spyware","eventTech":"firewall","eventArea":"fwaas","urlOriginal":"undefined","eventID":"901e1155-5407-48ce-bddb-c7469fcf5c48","tag":"[Spyware-fwaas]"}
You have to use custom mapping facilitated with #payload annotation. For more information please refer https://siddhi-io.github.io/siddhi-map-json/api/5.0.2/#json-sink-mapper
#sink(type='inMemory', topic='{{symbol}}',
#map(type='json',
#payload( """{"StockData":{"Symbol":"{{symbol}}","Price":{{price}}}""")))
define stream BarStream (symbol string, price float, volume long);
I have a YAML file that contains 3 types of fields defined in the example below. Essentially I want to be able to parse this into generic case classes that represent those data models.
This YAML file will change very often, including field names, values, etc. The only thing that won't change is the high level format of each data type (seen below)
The biggest problem is how can you define a case class that accepts multiple types into the same field and parse the YAML into them?
Most of the examples online don't seem to have a lot on this subject, so I tried a couple different things that ultimately came up short. It looks like there's a problem with using sum types like Either with the circe library as I get the below error. I also tried using a sealed trait and union types to no avail.
Example YAML File:
name: ExampleYamlMapping
version: 0.0
mappings:
# Single Value Field
- name: fieldtype1
value: "singlevalue"
# Multivalue Fields, Unformatted
- name: fieldtype2
value:
- "multivalue"
- "multivalue1"
# Formatted Multivalue field
- name: fieldtype3
content_type: "formatted multivalue"
format: "key1 | key2"
mappings:
- name: key1 # Single Value Field
value: "singlevalue"
- name: key2 # Multivalue Field, Unformatted
value:
- "multivalue1"
- "multivalue2"
Example Case Classes:
case class UnorderedField(name: String, value: Either[String, List[String]])
case class OrderedMultiValueField(content_type: String,
format: String,
mappings: List[Either[UnorderedField, OrderedMultiValueField]])
case class ContentMappingExample(
name: String,
version: String,
mappings: List[Either[UnorderedField, OrderedMultiValueField]]
)
Parsing Logic:
import io.circe.generic.auto._
import io.circe.{Error, Json, ParsingFailure, yaml}
val mappingSource = scala.io.Source.fromFile(mappingFilePath)
mappingData = try mappingSource.mkString finally mappingSource.close()
val mappings: Either[ParsingFailure, Json] = yaml.parser.parse(mappingData)
val contentMapping: ContentMappingExample = mappings
.leftMap(err => err: Error)
.flatMap(_.as[ContentMappingExample])
.valueOr(throw _)
Error Message is:
CNil: DownArray,DownField(mappings)
DecodingFailure(CNil, List(DownArray, DownField(mappings)))
Update on this: I figured out that you can create Algebraic Data Types (ADTs) and define custom encoders in circe. I followed the following example which works for me: https://circe.github.io/circe/codecs/adt.html
I am using Scala with play JSON library for parsing JSON. We are the facing the problem using JSON parsing is, we have same JSON structure, but some JSON files contain different with values structure with the same key name. Let's take an example:
json-1
{
"id": "123456",
"name": "james",
"company": {
"name": "knoldus"
}
}
json-2
{
"id": "123456",
"name": "james",
"company": [
"knoldus"
]
}
my case classes
case class Company(name: String)
object Company{
implicit val _ = Json.format[Company]
}
case class User(id: String, name: String, company: Company)
object User{
implicit val _ = Json.format[Company]
}
while JSON contains company with JSON document, we are getting successfully parsing, but if company contains an array, we are getting parsing exception. Our requirements, are is there anyway, we can use play JSON library and ignore the fields if getting parsing error rather that, ignore whole JSON file. If I am getting, company array values, ignore company field and parse rest of them and map corresponding case class.
I would do a pre-parse function that will rename the 'bad' company.
See the tutorial for inspiration: Traversing-a-JsValue-structure
So your parsing will work, with this little change:
case class User(id: String, name: String, company: Option[Company])
The company needs to be an Option.
Final we found the answer to resolving this issue, as we know, we have different company structure within JSON, so what we need to do, we need to declare company as a JsValue because in any case, whatever the company structure is, it is easily assigned to JsValue type. After that, our requirements are, we need to use object structure, and if JSON contains array structure, ignore it. After that, we used pattern matching with our company JsValue type and one basis of success and failure, we parse or JSON. The solution with code is given below:
case class Company(name: String)
object Company{
implicit val _ = Json.format[Company]
}
case class User(id: String, name: String, company: JsValue)
object User{
implicit val _ = Json.format[Company]
}
Json.parse("{ --- whatevery json--string --- }").validate[User].asOpt match {
case Some(obj: JsObject) => obj.as[Company]
case _ => Company("no-name")
}
I am trying to programmatically enforce schema(json) on textFile which looks like json. I tried with jsonFile but the issue is for creating a dataframe from a list of json files, spark has to do a 1 pass through the data to create a schema for the dataframe. So it needs to parse all the data which is taking longer time (4 hours since my data is zipped and of size TBs). So I want to try reading it as textFile and enforce schema to get interested fields alone to later query on the resulting data frame. But I am not sure how do I map it to the input. Can some give me some reference on how do I map schema to json like input.
input :
This is the full schema :
records: org.apache.spark.sql.DataFrame = [country: string, countryFeatures: string, customerId: string, homeCountry: string, homeCountryFeatures: string, places: array<struct<freeTrial:boolean,placeId:string,placeRating:bigint>>, siteName: string, siteId: string, siteTypeId: string, Timestamp: bigint, Timezone: string, countryId: string, pageId: string, homeId: string, pageType: string, model: string, requestId: string, sessionId: string, inputs: array<struct<inputName:string,inputType:string,inputId:string,offerType:string,originalRating:bigint,processed:boolean,rating:bigint,score:double,methodId:string>>]
But I am only interested in few fields like :
res45: Array[String] = Array({"requestId":"bnjinmm","siteName":"bueller","pageType":"ad","model":"prepare","inputs":[{"methodId":"436136582","inputType":"US","processed":true,"rating":0,"originalRating":1},{"methodId":"23232322","inputType":"UK","processed":falase,"rating":0,"originalRating":1}]
val records = sc.textFile("s3://testData/sample.json.gz")
val schema = StructType(Array(StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("pageType",StringType,true),
StructField("inputs", ArrayType(
StructType(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("methodId",StringType,true)
),true),true)))
val rowRDD = ??
val inputRDD = sqlContext.applySchema(rowRDD, schema)
inputRDD.registerTempTable("input")
sql("select * from input").foreach(println)
Is there any way to map this ? Or do I need to use son parser or something. I want to use textFile only because of the constraints.
Tried with :
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
But keeping getting the error :
<console>:37: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
StructField("inputs",ArrayType(StructType(StructField("inputType",StringType,true), StructField("originalRating",LongType,true), StructField("processed",BooleanType,true), StructField("rating",LongType,true), StructField("score",DoubleType,true), StructField("methodId",StringType,true)),true),true)))
^
It can load with following code with predefined schema, spark don't need to go through the file in ZIP file. The code in the question has ambiguity.
import org.apache.spark.sql.types._
val input = StructType(
Array(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("score",DoubleType,true),
StructField("methodId",StringType,true)
)
)
val schema = StructType(Array(
StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("inputs",
ArrayType(input,true),
true)
)
)
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
Not all the fields need to be provided. While it's good to provide all if possible.
Spark try best to parse all, if some row is not valid. It will add _corrupt_record as a column which contains the whole row.
While if it's plained json file file.
I'm using Spray Json and want to use the default values defined in case classes, if a value is missing in the Json that populates the objects.
Example
Let's say I want to create an object from the case class Person but using a json document without age to do so:
case class Person(name: String, city: String, age: Int = -1)
{"name": "john", "city": "Somecity"}
How can I use the default value with Spray Json?
The only way I know how to do this is to use Option[T] as field type. And, if the field is optional, this is the semantically right way to do it:
case class Person(name: String, city: String, age: Option[Int])
When age is not present, age will be None. Since from your example, you use a absurd value (-1) as a marker that age is absent, using an Option will help you much more.
But, if you really need to have a default value, you can either have another case class that is filled from the one you've got from the JSON, using getOrElse, or use getOrElse(defaultValue) in your code when you need it.