Dealing with JSON in Scala? - json

In Scala 2.11, having the below code:
import play.api.libs.json._
...
val data = // read json from file (3)
val JSON: JsValue = Json.parse(data mkString "\n") (4)
val items = JSON \ "items"
for (i <- 0 until 100) yield items(i)
if I unite the last two lines for (i <- 0 until 100) yield (JSON \ "items")(i), will the term JSON \ "items" be evaluated for each i or only once?
is it worth to parallelise the list construction with this
for-expression (I don't care about the order in which items will
appear in the list), where items is an array of JSON objects?
what is the best way to handle exceptions from parsing the JSON in the lines (3 - 4) and validate it?

If you use the expression JSON \ "items" 100 times instead of 1, there'll be 100 times the work to find those nodes - there's no majick memoization or anything like that going on. Your cost is O(n) relative to the number of times you execute it - not O(1). But in any case, for this application the difference is inconsequential - assuming there's no outer loop you're not showing us.
This is way too small beans for parallelization to make any sense - in fact, the overhead might slow things down. If your real case was yield expensiveComputationBasedOn(items(i)), then maybe.
For lines 3-4, yes, use a Try here if you need to handle it here, else farther up (in the method that called the method that called this). In general, catch exceptions at the highest level where you can still provide sufficient information about what went wrong in a log message, where you can do any failure recovery, and where you'll be able to debug. This saves work and makes sure that you catch everything - even what you don't think of. If that's in your "main", fine. An Option will not catch exceptions. Caution: if this is for a class, your teacher may be looking for local error handling, regardless.

Related

Split RDD of JSON-s by size in Scala

Suppose we have a lot of JSON-s in HDFS, but for a prototype we load some JSON-s locally into Spark with:
val eachJson = sc.textFile("JSON_Folder/*.json")
I want to write a job which goes through the eachJson RDD[String] and calculates the size of each JSON. The size then is added to an accumulator and the corresponding JSON is added to a StringBuilder. But when the size of the concatenated JSON-s exceeds a threshold, then we start to store the other JSON-s in a new StringBuilder.
For instance, if we have 100 JSON-s, and we start to calculate the size of them one by one, we observe that from the 32th element the size of the concatenated JSON-s exceeds the threshold, then we group together only the first 31 JSON-s. After that we start again from the 32th element.
What I managed to do until now is to obtain the indexes where we have to split the RDD based on the following code:
eachJson.collect()
.map(_.getBytes("UTF-8").length)
.scanLeft(0){_ + _}
.takeWhile(_ < 20000) //threshold = 20000
.length-1
Also I tried:
val accum = sc.accumulator(0, "My Accumulator")
val buf = new StringBuilder
while(accum.value < 20000)
{
for(i <- eachJson)
{
accum.add(i.getBytes("UTF-8").length)
buf ++= i
}
}
But I receive the following error:
org.apache.spark.SparkException: Task not serializable.
How can I do this in Spark via Scala?
I use Spark 1.6.0 and Scala 2.10.6
Spark's progamming model is not ideal for what you are trying to achieve, if we take the general problem of "aggregating elements depending on something that can only be known by inspecting previous elements", for two reasons :
Spark does not, generally speaking, impose an ordering over the datas (but it can do it)
Sparks deals with datas in partitions, and the sizes of the partitions are not usually (e.g. by default) dependant on the contents of the data, but by a default partitionner whose role is to divide datas evenly into partitions.
So it's not really a question of possible (it is), it rather is a question of "how much does it cost" (CPU / memory / time), for what it buys you.
A draft for an exact solution
If I were to shoot for an exact solution (by exact, I mean : preserving elements order, defined by, e.g. a timestamp in the JSONs, and grouping exactly consecutive inputs to the largest amount that approaches the boundary), I would :
Impose an ordering on the RDD (there is a sortBy function, which does that) : this is a full data shuffle, so it IS expensive.
Give each row an id, after the sort, (there is a RDD version of zipWithIndex which respects ordering on the RDD, if it exists. There is also a faster dataframe equivalent, that creates monotically increasing indexes, albeit non consecutive ones).
Collect the fraction of the result that is necessary to calculate size boundaries (the boundaries being the ids defined at step 2), pretty much as you did. This again is a full pass on the datas.
Create a partitionner of datas that respects these boundaries (e.g. make sure that each elements of a single boundary are all in the same partition), and apply this partitionner to the RDD obtained at step 2 (another full shuffle on the datas). You just got yourself partitions that are logically equivalent to what you expect, e.g. groups of elements whose sum of sizes is under a certain limit. But the ordering inside each partition may have been lost in the repartitionning process. So you are not over yet !
Then I would mapPartitions on this result to :
5.1. resort the datas locally to each partition,
5.2. group items in the data structure I need once sorted
One of the key being not to apply anything that messes with partitions between step 4 and 5.
As long as the "partition map" fits into the driver's memory, this is almost a practical solution, but a very costly one.
A simpler version (with relaxed constraints)
If it is ok for groups not to reach an optimal size, then the solution becomes much simpler (and it respects the ordering of the RDD if you have set one) : it is pretty much what you would code if there was no Spark at all, just an Iterator of JSON files.
Personnaly, I'd define a recursive accumulator function (nothing spark related) like so (I guess you could write your shorter, more efficient version using takeWhile) :
/**
* Aggregate recursively the contents of an iterator into a Seq[Seq[]]
* #param remainingJSONs the remaining original JSON contents to be aggregated
* #param currentAccSize the size of the active accumulation
* #param currentAcc the current aggregation of json strings
* #param resultAccumulation the result of aggregated JSON strings
*/
#tailrec
def acc(remainingJSONs: Iterator[String], currentAccSize: Int, currentAcc: Seq[String], resultAccumulation: Seq[Seq[String]]): Seq[Seq[String]] = {
// IF there is nothing more in the current partition
if (remainingJSONs.isEmpty) {
// And were not in the process of acumulating
if (currentAccSize == 0)
// Then return what was accumulated before
resultAccumulation
else
// Return what was accumulated before, and what was in the process of being accumulated
resultAccumulation :+ currentAcc
} else {
// We still have JSON items to process
val itemToAggregate = remainingJSONs.next()
// Is this item too large for the current accumulation ?
if (currentAccSize + itemToAggregate.size > MAX_SIZE) {
// Finish the current aggregation, and proceed with a fresh one
acc(remainingJSONs, itemToAggregate.size, Seq(itemToAggregate), resultAccumulation :+ currentAcc)
} else {
// Accumulate the current item on top of the current aggregation
acc(remainingJSONs, currentAccSize + itemToAggregate.size, currentAcc :+ itemToAggregate, resultAccumulation)
}
}
}
No you take this accumulating code, and make it run for each partition of spark's dataframe :
val jsonRDD = ...
val groupedJSONs = jsonRDD.mapPartitions(aPartition => {
acc(aPartition, 0, Seq(), Seq()).iterator
})
This will turn your RDD[String] into a RDD[Seq[String]] where each Seq[String] is made of consecutive RDD elements (which may be predictible if the RDD has been sorted, and may not otherwise), whose total length is below the threshold.
What may be "sub-optimal" is that, at the end of each partition, may lie a Seq[String] with just a few (possibly, a single) JSONs, while at the beginning of the following partition, a full one was created.
Not an answer; just to point you to the right direction. You get "Task is not serializable" exception because your val buf = new StringBuilder is used inside RDD's foreach (for(i <- eachJson)). Spark cannot distribute your buf variable as StringBuilder itself is not serializable. Besides you shouldn't access mutable state directly. So recommendation is to put all data you need to Accumulator, not just sizes:
case class MyAccumulator(size: Int, result: String)
And use something like rdd.aggregate or rdd.fold:
eachJson.fold(MyAccumulator(0, ""))(...)
//or
eachJson.fold(List.empty[MyAccumulator])(...)
Or just use it with scanLeft as you collect anyway.
Be aware that this won't be scalable (same as StringBuilder/collect solution). In order to make it scalable - use mapPartitions.
Update. mapPartitions would give you an ability to partially aggregate your JSONs as you would get "local" iterator (partition) as your input - you can operate it as a regular scala collection. It might be enough if you ok with some small percent JSONs not being concatenated.
eachJson.mapPartitions{ localCollection =>
... //compression logic here
}

How does Ruby JSON.parse differ to OJ.load in terms of allocating memory/Object IDs

This is my first question and I have tried my best to find an answer - I have looked everywhere for an answer but haven't managed to find anything concrete to answer this in both the oj docs and ruby json docs and here.
Oj is a gem that serves to improve serialization/deserialization speeds and can be found at: https://github.com/ohler55/oj
I noticed this difference when I tried to dump and parse a hash with a NaN contained in it, twice, and compared the two, i.e.
# Create json Dump
dump = JSON.dump ({x: Float::NAN})
# Create first JSON load
json_load = JSON.parse(dump, allow_nan: true)
# Create second JSON load
json_load_2 = JSON.parse(dump, allow_nan: true)
# Create first OJ load
oj_load = Oj.load(dump, :mode => :compat)
# Create second OJload
oj_load_2 = Oj.load(dump, :mode => :compat)
json_load == json_load_2 # Returns true
oj_load == oj_load_2 # Returns false
I always thought NaN could not be compared to NaN so this confused me for a while until I realised that json_load and json_load_2 have the same object ID and oj_load and oj_load_2 do not.
Can anyone point me in the direction of where this memory allocation/object ID allocation occurs or how I can control that behaviour with OJ?
Thanks and sorry if this answer is floating somewhere on the internet where I could not find it.
Additional info:
I am running Ruby 1.9.3.
Here's the output from my tests re object IDs:
puts Float::NAN.object_id; puts JSON.parse(%q({"x":NaN}), allow_nan: true)["x"].object_id; puts JSON.parse(%q({"x":NaN}), allow_nan: true)["x"].object_id
70129392082680
70129387898880
70129387898880
puts Float::NAN.object_id; puts Oj.load(%q({"x":NaN}), allow_nan: true)["x"].object_id; puts Oj.load(%q({"x":NaN}), allow_nan: true)["x"].object_id
70255410134280
70255410063100
70255410062620
Perhaps I am doing something wrong?
I believe that is a deep implementation detail. Oj does this:
if (ni->nan) {
rnum = rb_float_new(0.0/0.0);
}
I can't find a Ruby equivalent for that, Float.new doesn't appear to exist, but it does create a new Float object every time (from an actual C's NaN it constructs on-site), hence different object_ids.
Whereas Ruby's JSON module uses (also in C) its own JSON::NaN Float object everywhere:
CNaN = rb_const_get(mJSON, rb_intern("NaN"));
That explains why you get different NaNs' object_ids with Oj and same with Ruby's JSON.
No matter what object_ids the resulting hashes have, the problem is with NaNs. If they have the same object_ids, the enclosing hashes are considered equal. If not, they are not.
According to the docs, Hash#== uses Object#== for values that only outputs true if and only if the argument is the same object (same object_id). This contradicts NaN's property of not being equal to itself.
Spectacular. Inheritance gone haywire.
One could, probably, modify Oj's C code (and even make a pull request with it) to use a constant like Ruby's JSON module does. It's a subtle change, but it's in the spirit of being compat, I guess.

How to best validate JSON on the server-side

When handling POST, PUT, and PATCH requests on the server-side, we often need to process some JSON to perform the requests.
It is obvious that we need to validate these JSONs (e.g. structure, permitted/expected keys, and value types) in some way, and I can see at least two ways:
Upon receiving the JSON, validate the JSON upfront as it is, before doing anything with it to complete the request.
Take the JSON as it is, start processing it (e.g. access its various key-values) and try to validate it on-the-go while performing business logic, and possibly use some exception handling to handle vogue data.
The 1st approach seems more robust compared to the 2nd, but probably more expensive (in time cost) because every request will be validated (and hopefully most of them are valid so the validation is sort of redundant).
The 2nd approach may save the compulsory validation on valid requests, but mixing the checks within business logic might be buggy or even risky.
Which of the two above is better? Or, is there yet a better way?
What you are describing with POST, PUT, and PATCH sounds like you are implementing a REST API. Depending on your back-end platform, you can use libraries that will map JSON to objects which is very powerful and performs that validation for you. In JAVA, you can use Jersey, Spring, or Jackson. If you are using .NET, you can use Json.NET.
If efficiency is your goal and you want to validate every single request, it would be ideal if you could evaluate on the front-end if you are using JavaScript you can use json2.js.
In regards to comparing your methods, here is a Pro / Cons list.
Method #1: Upon Request
Pros
The business logic integrity is maintained. As you mentioned trying to validate while processing business logic could result in invalid tests that may actually be valid and vice versa or also the validation could inadvertently impact the business logic negatively.
As Norbert mentioned, catching the errors before hand will improve efficiency. The logical question this poses is why spend the time processing, if there are errors in the first place?
The code will be cleaner and easier to read. Having validation and business logic separated will result in cleaner, easier to read and maintain code.
Cons
It could result in redundant processing meaning longer computing time.
Method #2: Validation on the Go
Pros
It's efficient theoretically by saving process and compute time doing them at the same time.
Cons
In reality, the process time that is saved is likely negligible (as mentioned by Norbert). You are still doing the validation check either way. In addition, processing time is wasted if an error was found.
The data integrity can be comprised. It could be possible that the JSON becomes corrupt when processing it this way.
The code is not as clear. When reading the business logic, it may not be as apparent what is happening because validation logic is mixed in.
What it really boils down to is Accuracy vs Speed. They generally have an inverse relationship. As you become more accurate and validate your JSON, you may have to compromise some on speed. This is really only noticeable in large data sets as computers are really fast these days. It is up to you to decide what is more important given how accurate you think you data may be when receiving it or whether that extra second or so is crucial. In some cases, it does matter (i.e. with the stock market and healthcare applications, milliseconds matter) and both are highly important. It is in those cases, that as you increase one, for example accuracy, you may have to increase speed by getting a higher performant machine.
Hope this helps.
The first approach is more robust, but does not have to be noticeably more expensive. It becomes way less expensive even when you are able to abort the parsing process due to errors: Your business logic usually takes >90% of the resources in a process, so if you have an error % of 10%, you are already resource neutral. If you optimize the validation process so that the validations from the business process are performed upfront, your error rate might be much lower (like 1 in 20 to 1 in 100) to stay resource neutral.
For an example on an implementation assuming upfront data validation, look at GSON (https://code.google.com/p/google-gson/):
GSON works as follows: Every part of the JSON can be cast into an object. This object is typed or contains typed data:
Sample object (JAVA used as example language):
public class someInnerDataFromJSON {
String name;
String address;
int housenumber;
String buildingType;
// Getters and setters
public String getName() { return name; }
public void setName(String name) { this.name=name; }
//etc.
}
The data parsed by GSON is by using the model provided, already type checked.
This is the first point where your code can abort.
After this exit point assuming the data confirmed to the model, you can validate if the data is within certain limits. You can also write that into the model.
Assume for this buildingType is a list:
Single family house
Multi family house
Apartment
You can check data during parsing by creating a setter which checks the data, or you can check it after parsing in a first set of your business rule application. The benefit of first checking the data is that your later code will have less exception handling, so less and easier to understand code.
I would definitively go for validation before processing.
Let's say you receive some json data with 10 variables of which you expect:
the first 5 variables to be of type string
6 and 7 are supposed to be integers
8, 9 and 10 are supposed to be arrays
You can do a quick variable type validation before you start processing any of this data and return a validation error response if one of the ten fails.
foreach($data as $varName => $varValue){
$varType = gettype($varValue);
if(!$this->isTypeValid($varName, $varType)){
// return validation error
}
}
// continue processing
Think of the scenario where you are directly processing the data and then the 10th value turns out to be of invalid type. The processing of the previous 9 variables was a waste of resources since you end up returning some validation error response anyway. On top of that you have to rollback any changes already persisted to your storage.
I only use variable type in my example but I would suggest full validation (length, max/min values, etc) of all variables before processing any of them.
In general, the first option would be the way to go. The only reason why you might need to think of the second option is if you were dealing with JSON data which was tens of MBs large or more.
In other words, only if you are trying to stream JSON and process it on the fly, you will need to think about second option.
Assuming that you are dealing with few hundred KB at most per JSON, you can just go for option one.
Here are some steps you could follow:
Go for a JSON parser like GSON that would just convert your entire
JSON input into the corresponding Java domain model object. (If GSON
doesn't throw an exception, you can be sure that the JSON is
perfectly valid.)
Of course, the objects which were constructed using GSON in step 1
may not be in a functionally valid state. For example, functional
checks like mandatory fields and limit checks would have to be done.
For this, you could define a validateState method which repeatedly
validates the states of the object itself and its child objects.
Here is an example of a validateState method:
public void validateState(){
//Assume this validateState is part of Customer class.
if(age<12 || age>150)
throw new IllegalArgumentException("Age should be in the range 12 to 120");
if(age<18 && (guardianId==null || guardianId.trim().equals(""))
throw new IllegalArgumentException("Guardian id is mandatory for minors");
for(Account a:customer.getAccounts()){
a.validateState(); //Throws appropriate exceptions if any inconsistency in state
}
}
The answer depends entirely on your use case.
If you expect all calls to originate in trusted clients then the upfront schema validation should be implement so that it is activated only when you set a debug flag.
However, if your server delivers public api services then you should validate the calls upfront. This isn't just a performance issue - your server will likely be scrutinized for security vulnerabilities by your customers, hackers, rivals, etc.
If your server delivers private api services to non-trusted clients (e.g., in a closed network setup where it has to integrate with systems from 3rd party developers), then you should at least run upfront those checks that will save you from getting blamed for someone else's goofs.
It really depends on your requirements. But in general I'd always go for #1.
Few considerations:
For consistency I'd use method #1, for performance #2. However when using #2 you have to take into account that rolling back in case of non valid input may become complicated in the future, as the logic changes.
Json validation should not take that long. In python you can use ujson for parsing json strings which is a ultrafast C implementation of the json python module.
For validation, I use the jsonschema python module which makes json validation easy.
Another approach:
if you use jsonschema, you can validate the json request in steps. I'd perform an initial validation of the most common/important parts of the json structure, and validate the remaining parts along the business logic path. This would allow to write simpler json schemas and therefore more lightweight.
The final decision:
If (and only if) this decision is critical I'd implement both solutions, time-profile them in right and wrong input condition, and weight the results depending on the wrong input frequency. Therefore:
1c = average time spent with method 1 on correct input
1w = average time spent with method 1 on wrong input
2c = average time spent with method 2 on correct input
2w = average time spent with method 2 on wrong input
CR = correct input rate (or frequency)
WR = wrong input rate (or frequency)
if ( 1c * CR ) + ( 1w * WR) <= ( 2c * CR ) + ( 2w * WR):
chose method 1
else:
chose method 2

Using JSON in Haskell to Serialize a Record

I've been working on a very small program to grab details about Half Life 2 servers (using the protocol-srcds library). The workflow is pretty straightforward; it takes a list of servers from a file, queries each of them, and writes the output out to another file (which is read by a PHP script for display, as I'm tied to vBulletin). Would be nice if it was done in SQL or something, but seeing as I'm still just learning, that's a step too far for now!
Anyway, my question relates to serialization, namely, serializing to JSON. For now, I've written a scrappy helper function jsonify, such that:
jsonify (Just (SRCDS.GameServerInfo serverVersion
serverName
serverMap
serverMod
serverModDesc
serverAppId
serverPlayers
serverMaxPlayers
serverBots
serverType
serverOS
serverPassword
serverSecure
serverGameVersioning)) =
toJSObject [ ("serverName", serverName)
, ("serverMap", serverMap)
, ("serverPlayers", show serverPlayers)
, ("serverMaxPlayers", show serverMaxPlayers) ]
(I'm using the Text.JSON package). This is obviously not ideal. At this stage, however, I don't understand using instances to define serializers for records, and my attempts to do so met a wall of frustration in the type system.
Could someone please walk me through the "correct" way of doing this? How would I go about defining an instance that serializes the record? What functions should I use in the instance (showJSON?).
Thanks in advance for any help.
You might want to consider using Data.Aeson instead which might be regarded as the successor to Text.JSON.
With aeson you define separate instances for serialize/deserializing (with Text.JSON you have to define both directions even if you need only one, otherwise the compile will annoy you -- unless you silence the warning somehow), and it provides a few operators making defining instances a bit more compact, e.g. the example from #hammar's answer can be written a little bit less noisy as shown below with the aeson API:
instance ToJSON SRCDS.GameServerInfo where
toJSON (SRCDS.GameServerInfo {..}) = object
[ "serverName" .= serverName
, "serverMap" .= serverMap
, "serverPlayers" .= serverPlayers
, "serverMaxPlayers" .= serverMaxPlayers
]
One simple thing you can do is use record wildcards to cut down on the pattern code.
As for your type system problems, it's hard to give help without seeing error messages and what you've tried so far, however I suspect one thing that might be confusing is that the result of toJSObject will have to be wrapped in a JSObject data constructor as the return type of showJSON is supposed to be a JSValue. Similarly, the values of your object should also be of type JSValue. The easiest way to do this is to use their JSON instance and call showJSON to convert the values.
instance JSON SRCDS.GameServerInfo where
showJSON (SRCDS.GameServerInfo {..}) =
JSObject $ toJSObject [ ("serverName", showJSON serverName)
, ("serverMap", showJSON serverMap)
, ("serverPlayers", showJSON serverPlayers)
, ("serverMaxPlayers", showJSON serverMaxPlayers) ]

Is it okay to rely on automatic pass-by-reference to mutate objects?

I'm working in Python here (which is actually pass-by-name, I think), but the idea is language-agnostic as long as method parameters behave similarly:
If I have a function like this:
def changefoo(source, destination):
destination["foo"] = source
return destination
and call it like so,
some_dict = {"foo": "bar"}
some_var = "a"
new_dict = changefoo(some_var, some_dict)
new_dict will be a modified version of some_dict, but some_dict will also be modified.
Assuming the mutable structure like the dict in my example will almost always be similarly small, and performance is not an issue (in application, I'm taking abstract objects and changing into SOAP requests for different services, where the SOAP request will take an order of magnitude longer than reformatting the data for each service), is this okay?
The destination in these functions (there are several, it's not just a utility function like in my example) will always be mutable, but I like to be explicit: the return value of a function represents the outcome of a deterministic computation on the parameters you passed in. I don't like using out parameters but there's not really a way around this in Python when passing mutable structures to a function. A couple options I've mulled over:
Copying the parameters that will be mutated, to preserve the original
I'd have to copy the parameters in every function where I mutate them, which seems cumbersome and like I'm just duplicating a lot. Plus I don't think I'll ever actually need the original, it just seems messy to return a reference to the mutated object I already had.
Just use it as an in/out parameter
I don't like this, it's not immediately obvious what the function is doing, and I think it's ugly.
Create a decorator which will automatically copy the parameters
Seems like overkill
So is what I'm doing okay? I feel like I'm hiding something, and a future programmer might think the original object is preserved based on the way I'm calling the functions (grabbing its result rather than relying on the fact that it mutates the original). But I also feel like any of the alternatives will be messy. Is there a more preferred way? Note that it's not really an option to add a mutator-style method to the class representing the abstract data due to the way the software works (I would have to add a method to translate that data structure into the corresponding SOAP structure for every service we send that data off too--currently the translation logic is in a separate package for each service)
If you have a lot of functions like this, I think your best bet is to write a little class that wraps the dict and modifies it in-place:
class DictMunger(object):
def __init__(self, original_dict):
self.original_dict = original_dict
def changefoo(source)
self.original_dict['foo'] = source
some_dict = {"foo": "bar"}
some_var = "a"
munger = DictMunger(some_dict)
munger.changefoo(some_var)
# ...
new_dict = munger.original_dict
Objects modifying themselves is generally expected and reads well.