Custom Formatting of JSON output using Spark - json

I have a dataset with a bunch of BigDecimal values. I would like to output these records to a JSON file, but when I do the BigDecimal values will often be written with trailing zeros (123.4000000000000), but the spec we are must conform to does not allow this (for reasons I don't understand).
I am trying to see if there is a way to override how the data is printed to JSON.
Currently, my best idea is to convert each record to a string using JACKSON and then writing the data using df.write().text(..) rather than JSON.

I suggest to convert Decimal type to String before writing to JSON.
Below code is in Scala, but you can use it in Java easily
import org.apache.spark.sql.types.StringType
# COLUMN_NAME is your DataFrame column name.
val new_df = df.withColumn('COLUMN_NAME_TMP', df.COLUMN_NAME.cast(StringType)).drop('COLUMN_NAME').withColumnRenamed('COLUMN_NAME_TMP', 'COLUMN_NAME')

Related

How to prevent adding backslash to JSON string

I would like to read events from eventhub using Databricks, events are in json format but they can have different schema (it's important because i find solutions in which the schema was given to from_json(jsonStr,schema) function, but i cannot use it in my use case). When i use
.withColumn('Value', col('value').cast(StringType() in dataframe returns json output with backslashes "{\"time\": 1432826855000,\"host\":...... .
I found a solution How to prevent spark sql with kafka from adding backslash to JSON string in dataframe but in Delta Live Tables framework we create streaming tables by returning a dataframe, so i cant use this solution.
Should i use non pyspark functions in etl process such as
How to remove backslash from decoded JSON string? ?
Will it be efficient during streaming from eventhub to bronze?
You shouldn't worry about that backslashes - it's just a visual representation of your string when you display data and it has " character embedded into a string. Internally, data will be stored without backslashes, like: {"time": 1432826855000,"host":.......

Convert doctrine array to JSON

Is there a way to read a column of doctrine type "simply_array" or "array" in json?
My doctrine database is approached from another api and I want to read data from that api. However there is a column of type doctrine array that I want to convert into JSON.
I am unsure if there is a preferred way of doing this or I need to hack my way around it.
Here is an example of what is stored in the database as a doctrine array:
"a:1:{i:0;a:3:{s:3:\u0022day\u0022;i:5;s:4:\u0022time\u0022;s:7:\u0022morning\u0022;s:12:\u0022availability\u0022;N;}}"
That looks like the format of PHP's serialize() function. And the literal double-quotes in the string have been converted to unicode escape sequences.
You could do the following:
Fetch the serialized string
Fix the \u0022 sequences (replace them with ")
unserialize() it to reproduce the array
Convert the array to JSON with json_encode().

converting a string to json format in scala

I have to convert a string to json format in scala. The string is like this:
"classification" : "Map(Metals -> List(Cu, Co, Ni), Nonmetals -> List(N,O,C), Noblegases -> List(Ar, Kr))"
The desired json format is like this:
"classification" : {"Metals": [Cu, Co, Ni],
"Nonmetals":[N,O,C],
"Noblegases":[Ar, Kr]
}
Any quick suggestions would be appreciated.
Your question is not very specific so my answer is a bit vague as well.
First you will have to parse the input string and extract the values. I would use a combination of regular expressions and simple String operations like searching for the first occurrence of a certain character (e.g. colon) and splitting the string there.
In the next step you create the JSON object. There are several libraries out there that you can use. I suggest JSON-Java/org.json or if you like to use a scala library you can use play-json.

Library to convert JSON string to Erlang record

I've a large JSON string, I want to convert this string into Erlang record.
I found jiffy library but it doesn't completely convert to record.
For example:
jiffy:decode(<<"{\"foo\":\"bar\"}">>).
gives
{[{<<"foo">>,<<"bar">>}]}
but I want the following output:
{ok,{obj,[{"foo",<<"bar">>}]},[]}
Is there any library that can be used for the desired output?
Or is there any library that can be used in combination of jiffy for further modifying the output of it.
Consider the fact the JSON string is large, and I want the output is minimum time.
Take a look at ejson, from the documentation:
JSON library for Erlang on top of jsx. It gives a declarative interface for jsx by which we need to specify conversion rules and ejson will convert tuples according to the rules.
I made this library to make easy not just the encoding but rather the decoding of JSONs to Erlang records...
In order for ejson to take effect the source files need to be compiled with parse_transform ejson_trans. All record which has -json attribute can be converted to JSON later.

How to convert between BSON and JSON, especially for those special objects?

I am not asking for any libraries to do so and I am just writing code for bson_to_json and json_to_bson.
so here is the BSON specification.
For regular double, doc, array, string, it is fine and it is easy to convert between BSON and JSON.
However, for those particular objects, such as
Timestamp and UTC:
If convert from JSON to BSON, how can I know they are timestamp and utc?
Regex (string, string), JavaScript code with scope (string, doc)
their structures have multiple parts, how can I present the structures in JSON?
Binary data (generic, function, etc)`
How can I present the type of binary data in JSON?
int32 and int64
How can I present them in JSON, so BSON can know which is 32 bit or 64 bit?
Thanks
As we know JSON cannot express objects so you will need to decide how you want the stringified version of the BSON objects (field types) to be represented within the output of your ocaml driver.
Some of the data types are easy, Timestamp is not needed since it is internal to sharding only and Javascript blocks are best left out due to the fact that they are best used only within system.js as saved functions for use in MRs.
You also gotta consider that some of these fields are actually both in and out. What I mean by in and out is that some are used to specify input documents to be serialised to BSON and some are part of output document that need deserialising from BSON into JSON.
Regex is one which will most likely be a field type you send down. As such you will need to serialise your ocaml object to the BSON equivilant of {$regex: 'd', '$options': 'ig'} from /d/ig PCRE representation.
Dates can be represented in JSON by either choosing to use the ISODate string or a timestamp for the representation. The output will be something like {$sec:556675,$usec:6787} and you can convert $sec to the display you need.
Binary data in JSON can be represented by taking the data (if I remember right) property from the output document and then encoding that to base 64 and storing it as a stirng in the field.
int32 and int64 has no real definition between the two in JSON except that 64bit ints will be bigger than 2147483647 so I am unsure if you can keep the data types unique there.
That should help get you started.