Specifying schema on JSON via Spark - json

I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy!
When inferring the schema customer id is set to String, and I would like to cast it as Double
so df1 is corrupted while df2 shows
Also FYI I need this to be generic as I would like to apply it to any json, I specified the below schema as an example of the issue I am facing
import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)
Any help would be appreciated, I am sure I am missing something obvious but for the life of me I cant tell what it is!
Let me clarify, I am loading a file that is saved using sparkContext.newAPIHadoopRDD
So converting a RDD[JsonObject] to a dataframe while applying the schema to it

The Json field since enclosed by double quotes is considered as a String .How about casting the column to Double?. this casting solution can be made generic if the details on what columns are expected to casted to Double is provided.
df1.select(df1("customerid").cast(DoubleType)).show()
+----------+
|customerid|
+----------+
| 535137.0|
+----------+

Related

Create JSON column in Spark Scala

I have some data that needs to be written as a JSON string after some transformations in a spark (+scala) job.
I'm using the to_json function along with struct and/or array function in order to build the final json that is requested.
I have one piece of the json that looks like:
"field":[
"foo",
{
"inner_field":"bar"
}
]
I'm not an expert in JSON, so I don't know if this structure is usual or not, all I know is that this is a valid JSON format.
I'm having trouble to create a dataframe column with this format and I want to know what is the best way to create this type of data columns.
Thanks in advance
If you have a dataframe with a bunch of columns you want to turn into a json string column, you can make use of the to_json and the struct functions. Something like this:
import org.apache.spark.sql.types._
val df = Seq(
(1, "string1", Seq("string2", "string3")),
(2, "string4", Seq("string5", "string6"))
).toDF("colA", "colB", "colC")
df.show
+----+-------+------------------+
|colA| colB| colC|
+----+-------+------------------+
| 1|string1|[string2, string3]|
| 2|string4|[string5, string6]|
+----+-------+------------------+
val newDf = df.withColumn("jsonString", to_json(struct($"colA", $"colB", $"colC")))
newDf.show(false)
+----+-------+------------------+--------------------------------------------------------+
|colA|colB |colC |jsonString |
+----+-------+------------------+--------------------------------------------------------+
|1 |string1|[string2, string3]|{"colA":1,"colB":"string1","colC":["string2","string3"]}|
|2 |string4|[string5, string6]|{"colA":2,"colB":"string4","colC":["string5","string6"]}|
+----+-------+------------------+--------------------------------------------------------+
struct makes a single StructType column from multiple columns and to_json turns them into a json string.
Hope this helps!

parquet format does not preserve the order of fields inside struct

Used Spark sql to convert a json to Parquet format, something like this
Scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Scala> val employee = sqlContext.read.json(“emplaoyee”)
Scala> employee.write.parquet(“employee.parquet”)
My employee json looks luke this
{"id":1,"name":"ABC","address":{"street":"s1","zipcode":123,"state":"KA"}}
I am trying to read the Parquet file using Amazon Athena. The table structure in Athena is like this
Create external table employee
{id string,
name string,
address struct
The problem is that order of the three fields inside address struct in the Parquet file is not preserved and thus I am not able to load it in the Athena schema defined as above.
Any help is much appreciated.

Cut off the field from JSON string

I have a JSON string like this:
{"id":"111","name":"abc","ids":["740"],"data":"abc"}
I want to cut off the field "ids", however I don't know apriori the values like ["740"]. So, it might be e.g. ["888,222"] or whatever. The goal is to get the json string without the field "ids".
How to do it? Should I use JackMapper?
EDIT:
I tried to use JackMapper as JacksMapper.readValue[Map[String, String]](jsonString)to get only fields that I need. But the problem is that"ids":["740"]` throws the parsing error because it's an array. So, I decided to cut off this field before parsing, though it's an ugly solution and ideally I just want to parse the json string into Map.
Not sure what JackMapper is, but if other libraries are allowed, my personal favourites would be:
Play-JSON:
val jsonString = """{"id":"111","name":"abc","ids":["740"],"data":"abc"}"""
val json = Json.parse(jsonString).as[JsObject]
val newJson = json - "ids"
Circe:
import io.circe.parser._
val jsonString = """{"id":"111","name":"abc","ids":["740"],"data":"abc"}"""
val json = parse(jsonString).right.get.asObject.get // not handling errors
val newJson = json.remove("ids")
Note that this is the minimal example to get you going which doesn't handle bad input etc.

How to enforce datatype for Spark DataFrame read from json

When I read a json (without schema) to a DataFrame, all the numeric types going to be Long. Is there a way to enforce an Integer type without giving a fully specified json schema?
you can convert the dataframe into a dataset with case class
val df = Seq((1,"ab"),(3,"ba")).toDF("A","B")
case class test(A: Int, B: String)
df.as[test]
or you duplicate the column and you recast the DF.
import org.apache.spark.sql.types.{StringType}
df.withColumn("newA", 'A.cast(StringType))

Parsing the JSON representation of database rows in Scala.js

I am trying out scala.js, and would like to use simple extracted row data in json format from my Postgres database.
Here is an contrived example of the type of json I would like to parse into some strongly typed scala collection, features to note are that there are n rows, various column types including an array just to cover likely real life scenarios, don't worry about the SQL which creates an inline table to extract the JSON from, I've included it for completeness, its the parsing of the JSON in scala.js that is causing me problems
with inline_t as (
select * from (values('One',1,True,ARRAY[1],1.0),
('Six',6,False,ARRAY[1,2,3,6],2.4494),
('Eight',8,False,ARRAY[1,2,4,8],2.8284)) as v (as_str,as_int,is_odd,factors,sroot))
select json_agg(t) from inline_t t;
[{"as_str":"One","as_int":1,"is_odd":true,"factors":[1],"sroot":1.0},
{"as_str":"Six","as_int":6,"is_odd":false,"factors":[1,2,3,6],"sroot":2.4494},
{"as_str":"Eight","as_int":8,"is_odd":false,"factors":[1,2,4,8],"sroot":2.8284}]
I think this should be fairly easy using something like upickle or prickle as hinted at here: How to parse a json string to a case class in scaja.js and vice versa but I haven't been able to find a code example, and I'm not up to speed enough with Scala or Scala.js to work it out myself. I'd be very grateful if someone could post some working code to show how to achieve the above
EDIT
This is the sort of thing I've tried, but I'm not getting very far
val jsparsed = scala.scalajs.js.JSON.parse(jsonStr3)
val jsp1 = jsparsed.selectDynamic("1")
val items = jsp1.map{ (item: js.Dynamic) =>
(item.as_str, item.as_int, item.is_odd, item.factors, item.sroot)
.asInstanceOf[js.Array[(String, Int, Boolean, Array[Int], Double)]].toSeq
}
println(items._1)
So you are in a situation where you actually want to manipulate JSON values. Since you're not serializing/deserializing Scala values from end-to-end, serialization libraries like uPickle or Prickle will not be very helpful to you.
You could have a look at a cross-platform JSON manipulation library, such as circe. That would give you the advantage that you wouldn't have to "deal with JavaScript data structures" at all. Instead, the library would parse your JSON and expose it as a Scala data structure. This is probably the best option if you want your code to also cross-compile.
If you're only writing Scala.js code, and you want a more lightweight version (no dependency), I recommend declaring types for your JSON "schema", then use those types to perform the conversion in a safer way:
import scala.scalajs.js
import scala.scalajs.js.annotation._
// type of {"as_str":"Six","as_int":6,"is_odd":false,"factors":[1,2,3,6],"sroot":2.4494}
#ScalaJSDefined
trait Row extends js.Object {
val as_str: String
val as_int: Int
val is_odd: Boolean
val factors: js.Array[Int]
val sroot: Double
}
type Rows = js.Array[Row]
val rows = js.JSON.parse(jsonStr3).asInstanceOf[Rows]
val items = (for (row <- rows) yield {
import row._
(as_str, as_int, is_odd, factors.toArray, sroot)
}).toSeq