How to load csv with nested columns using Apache Spark - csv

I have a csv file:
name,age,phonenumbers
Tom,20,"[{number:100200, area_code:555},{number:100300, area_code:444}]"
Harry,20,"[{number:100400, area_code:555},{number:100500, area_code:666}]"
How can I load this file in Spark to a RDD/Dataset of Person where Person object looks like:
class Person {
String name;
Integer age;
List<Phone> phonenumbers;
class Phone {
int number;
int area_code;
}
}

Unfortunately, the column names for the nested object don't have quotes in your example. Is that truly the case? Because if they DO have quotes (e.g. well-formed JSON) then you could very easily use the from_json function as below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = new ArrayType(new StructType()
.add("number", IntegerType)
.add("area_code", IntegerType), false)
val converted = input.withColumn("phones", from_json('phonenumbers, schema))
If that's not the case, then you'll need to use your own logic to convert the string into the actual nested object, such as:
import org.apache.spark.sql.functions._
case class Phone(number: Int, area_code:Int)
case class Person(name: String, age: Int, phonenumbers: Array[Phone])
val converted = input.map {
case Row(name: String, age: Int, phonenumbers: String) => {
import scala.util.matching.Regex
val phoneFormat = raw"\{number:(\d{6}), area_code:(\d{3})\}".r
val phones = for (m <- phoneFormat.findAllMatchIn(phonenumbers)) yield Phone(m.group(1).toInt, m.group(2).toInt)
Person(name, age, phones.toArray)
}
}

Related

Deserialize JSON array with different values type with kotlinx.serialization library

I'm trying to deserialize following String:
val stringJson = "{\"decomposed\":[\", \",{\"id\":4944372,\"name\":\"Johny\",\"various\":false,\"composer\":false,\"genres\":[]}]}"
Deserialization works fine with following code
#Serializable
data class Artist(
val decomposed: JsonArray
)
fun main() {
val jsonString = "{\"decomposed\":[\", \",{\"id\":4944372,\"name\":\"Johny\",\"various\":false,\"composer\":false,\"genres\":[]}]}"
println(Json.decodeFromString<Artist>(jsonString))
}
But I want to do something like
#Serializable
class Decomposed {
#Serializable
class DecomposedClassValue(val value: DecomposedClass)
#Serializable
class StringValue(val value: String)
}
#Serializable
data class DecomposedClass(
val id: Long? = null,
val name: String? = null,
val various: Boolean? = null,
val composer: Boolean? = null,
val genres: JsonArray? = null
)
#Serializable
data class Artist(
val decomposed: List<Decomposed>
)
fun main() {
val jsonString = "{\"decomposed\":[\", \",{\"id\":4944372,\"name\":\"Johny\",\"various\":false,\"composer\":false,\"genres\":[]}]}"
println(Json.decodeFromString<Artist>(jsonString))
}
But kotlinx.serialization expectedly fails with JsonDecodingException: Unexpected JSON token at offset 15: Expected '{, kind: CLASS'
And I can't figure out how can I rewrite my Decomposed so deserialization work. Can you please help me out?
What you are trying to do is called polymorphic deserialization.
It requires target classes of deserialization to have a common superclass (preferrably sealed):
#Serializable
data class Artist(
val decomposed: List<Decomposed>
)
#Serializable
sealed class Decomposed
#Serializable
class StringValue(val value: String) : Decomposed() //Can't add superclass to String, so we have to create a wrapper class which we could make extend Decomposed
#Serializable
data class DecomposedClass(
val id: Long? = null,
val name: String? = null,
val various: Boolean? = null,
val composer: Boolean? = null,
val genres: JsonArray? = null
) : Decomposed() //DecomposedClassValue is redundant, we may extend DecomposedClass from Decomposed directly
This will allow you to deserialize JSON of the following format:
val jsonString = "{\"decomposed\":[{\"type\":\"StringValue\", \"value\":\",\"}, {\"type\":\"DecomposedClass\", \"id\":4944372,\"name\":\"Johny\",\"various\":false,\"composer\":false,\"genres\":[]}]}"
Since there is no class descriminator in original JSON, serialization library can't determine the actual serializer which should be used to deserialize Kotlin class. You will have to write custom JsonContentPolymorphicSerializer and wire it to Decomposed class; also you have to write custom serializer for StringValue class, as it is represented in JSON as a String, not a JSONObject with a value field of String type:
object DecomposedSerializer : JsonContentPolymorphicSerializer<Decomposed>(Decomposed::class) {
override fun selectDeserializer(element: JsonElement) = when {
element is JsonPrimitive -> StringValue.serializer()
else -> DecomposedClass.serializer()
}
}
object StringValueSerializer : KSerializer<StringValue> {
override val descriptor: SerialDescriptor = buildClassSerialDescriptor("StringValue")
override fun deserialize(decoder: Decoder): StringValue {
require(decoder is JsonDecoder)
val element = decoder.decodeJsonElement()
return StringValue(element.jsonPrimitive.content)
}
override fun serialize(encoder: Encoder, value: StringValue) {
encoder.encodeString(value.value)
}
}
#Serializable(with = DecomposedSerializer::class)
sealed class Decomposed
#Serializable(with = StringValueSerializer::class)
class StringValue(val value: String) : Decomposed()
This will allow you to deserialize JSON of original format.

How to convert list of int in json to list/array of enums using Moshi?

I'm getting a list of ints (which are really enums) from the API. When I try to parse it, I get: Unable to create converter for java.util.List<MyEnum>
My adapter is currently looking like this:
#Retention(AnnotationRetention.RUNTIME)
#JsonQualifier
annotation class MyEnumListAnnotation
class MyEnumListAdapter {
#ToJson
fun toJson(#MyEnumListAnnotation myEnumList: List<MyEnum>): List<Int> {
return myEnumList.map { it.type }
}
#FromJson
#MyEnumListAnnotation
fun fromJson(typeList: List<Int>): List<MyEnum> {
return typeList.map { MyEnum.from(it) }
}
}
I'm adding this to the network client like this:
Moshi.Builder()
.add([A lot of other adapters])
.add(MyEnumListAdapter())
And I'm using the annotation like this (in the object I want to parse to):
data class InfoObject(
val id: String,
val name: String,
val email: String,
val phone: String,
#MyEnumListAnnotation
val myEnums: List<MyEnum>
)
How can I write my adapter so that this is working? Thanks for all help. :)
If you use Moshi's codegen (which you should), you only need to write adapter for your MyEnum itself.
class MyEnumAdapter {
#ToJson
fun toJson(enum: MyEnum): Int {
return enum.type
}
#FromJson
fun fromJson(type: Int): MyEnum {
return MyEnum.from(it)
}
}
Attach the adapter to your Moshi builder the way you did it in your question. Then, update your InfoObject:
#JsonClass(generateAdapter = true)
data class InfoObject(
#Json(name = "id") val id: String,
#Json(name = "name") val name: String,
#Json(name = "email") val email: String,
#Json(name = "phone") val phone: String,
#Json(name = "myEnums") val myEnums: List<MyEnum>
)
#JsonClass(generateAdapter = true) will ensure that the library will auto-create an adapter for your InfoObject, including an adapter for List<MyEnum> (the one you tried to create yourself), so you don't have to create those adapters yourself. #Json(name="...") is just a convention, you can omit it.
To integrate codegen, just add to dependencies:
kapt("com.squareup.moshi:moshi-kotlin-codegen:1.9.3")
See https://github.com/square/moshi for more details.

convert json string to case class object from given json string and type of case class

Requirement is to convert json string to case class object in scala given jsonString and the type of the case class.
I have tried Gson and jackson libraries, but not able to solve the given requirment.
package eg.json
import com.fasterxml.jackson.databind.ObjectMapper
import com.google.gson.Gson
import com.typesafe.scalalogging.LazyLogging
case class Person(name : String, age : Int)
case class Address(street : String, buildingNumber : Int, zipCode : Int)
case class Rent(amount : Double, month : String)
//there are many other case classes
object JsonToObject extends LazyLogging{
import logger._
def toJsonString(ref : Any) : String = {
val gson = new Gson()
val jsonString = gson.toJson(ref)
jsonString
}
def main(args: Array[String]): Unit = {
val person = Person("John", 35)
val jsonString = toJsonString(person)
//here requirement is to convert json string to case class instance, provided the type of case class instance
val gsonObj = toInstanceUsingGson( jsonString, Person.getClass )
debug(s"main : object deserialized using gson : $gsonObj")
val jacksonObj = toInstanceUsingJackson( jsonString, Person.getClass )
debug(s"main : object deserialized using gson : $jacksonObj")
}
def toInstanceUsingGson[T](jsonString : String, caseClassType : Class[T]) : T = {
val gson = new Gson()
val ref = gson.fromJson(jsonString, caseClassType)
ref
}
def toInstanceUsingJackson[T](jsonString : String, caseClassType : Class[T]) : T = {
val mapper = new ObjectMapper()
val ref = mapper.readValue(jsonString, caseClassType)
ref
}
}
Output of execution of above code is :-
01:32:52.369 [main] DEBUG eg.json.JsonToObject$ - main : object deserialized using gson : Person
Exception in thread "main" com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "name" (class eg.json.Person$), not marked as ignorable (0 known properties: ])
at [Source: (String)"{"name":"John","age":35}"; line: 1, column: 10] (through reference chain: eg.json.Person$["name"])
at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:60)
at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:822)
at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1152)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1589)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1567)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:294)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4013)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3004)
at eg.json.JsonToObject$.toInstanceUsingJackson(JsonToObject.scala:49)
at eg.json.JsonToObject$.main(JsonToObject.scala:34)
at eg.json.JsonToObject.main(JsonToObject.scala)
Kindly suggest, how to achieve this using gson or jackson, or suggest some other library with sample example.
Above simplified problem is on github :-
https://github.com/moglideveloper/JsonToScalaObject
With Jackson you can do it like this:
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
val mapper = new ObjectMapper() with ScalaObjectMapper
//this line my be needed depending on your case classes
mapper.registerModule(DefaultScalaModule)
def fromJson[T](json: String)(implicit m: Manifest[T]): T = {
mapper.readValue[T](json)
}
I think it is really clean with Jackson lib.
The usage is like this:
val json: String = ???
val personObject: Person = fromJson[Person](json)
Try using circe by Cats.
add circe to your project (https://circe.github.io/circe/ - Quick Start).
create a case class that represent what you want to build from your json.
declare a decoder
https://circe.github.io/circe/codecs/semiauto-derivation.html
https://github.com/circe/circe
import io.circe.parser.decode
import io.circe.syntax._
case class DataToDecode(name : String,
age : Int,
street : String,
buildingNumber : Int,
zipCode : Int,
amount : Double,
month : String)
object DataToDecode{
implicit val dataToDecode: Decoder[DataToDecode] = deriveDecoder
def decodeData(data: Json) : DataToDecode {
data.as[DataToDecode].right.get
}
}
nice example here

Parse a nested JSON with Kotlinx.Serialization

I've been playing with Kotlinx.serialization, and I have been trying to parse a substring:
Given a JSON like:
{
"Parent" : {
"SpaceShip":"Tardis",
"Mark":40
}
}
And my code is something like:
data class SomeClass(
#SerialName("SpaceShip") ship:String,
#SerialName("Mark") mark:Int)
Obviously, Json.nonstrict.parse(SomeClass.serializer(), rawString) will fail because the pair "SpaceShip" and "Mark" are not in the root of the JSON.
How do I make the serializer refer to a subtree of the JSON?
P.S: Would you recommend retrofit instead (because it's older, and maybe more mature)?
#Serializable
data class Parent(
#SerialName("Parent")
val someClass: SomeClass
)
#Serializable
data class SomeClass(
#SerialName("SpaceShip")
val ship: String,
#SerialName("Mark")
val mark: Int
)
fun getSomeClass(inputStream: InputStream): SomeClass {
val json = Json(JsonConfiguration.Stable)
val jsonString = Scanner(inputStream).useDelimiter("\\A").next()
val parent = json.parse(Parent.serializer(), jsonString)
return parent.someClass
}
import kotlinx.serialization.*
import kotlinx.serialization.json.Json
#Serializable
data class Parent(
#SerialName("Parent")
val parent: SomeClass
)
#Serializable
data class SomeClass(
#SerialName("SpaceShip")
val ship:String,
#SerialName("Mark")
val mark:Int
)
fun main() {
val parent = Json.parse(Parent.serializer(), "{\"Parent\":{\"SpaceShip\":\"Tardis\",\"Mark\":40}}")
println(parent)
}

JSON reader Kotlin

How can I read JSON file into more than one documents and save it in Mongo DB.
I have two models:
#Document
data class Person(val name: String){
#Id
private val id : String? = null
And:
#Document
data class Floor (private var floorName: StoreyEnum,
private val roomNumber: String
private val personID: String){
#Id
private val id : String? = null}
I have JSON file in which I have fields to both models. Moreover I want connect this documents with "relation", how can I do that?
Use Gson if it's on a JVM backend.
BTW, I don't quite get your purpose of making id private, val, and initialized to null at the same time. Because in that way it's always set to null, never changed and never read. so I changed it to this:
data class Person(val name: String, private val id: String? = null)
Then you can use Gson to encode and parse the object:
fun main(args: Array<String>) {
val gson = Gson()
val person = Person("name", "0")
println(person)
val personJson = gson.toJson(person)
println(personJson)
val parsedPerson = gson.fromJson(personJson, Person::class.java)
println(parsedPerson)
}
Output:
Person(name=name, id=0)
{"name":"name","id":"0"}
Person(name=name, id=0)