Split RDD of JSON-s by size in Scala - json

Suppose we have a lot of JSON-s in HDFS, but for a prototype we load some JSON-s locally into Spark with:
val eachJson = sc.textFile("JSON_Folder/*.json")
I want to write a job which goes through the eachJson RDD[String] and calculates the size of each JSON. The size then is added to an accumulator and the corresponding JSON is added to a StringBuilder. But when the size of the concatenated JSON-s exceeds a threshold, then we start to store the other JSON-s in a new StringBuilder.
For instance, if we have 100 JSON-s, and we start to calculate the size of them one by one, we observe that from the 32th element the size of the concatenated JSON-s exceeds the threshold, then we group together only the first 31 JSON-s. After that we start again from the 32th element.
What I managed to do until now is to obtain the indexes where we have to split the RDD based on the following code:
eachJson.collect()
.map(_.getBytes("UTF-8").length)
.scanLeft(0){_ + _}
.takeWhile(_ < 20000) //threshold = 20000
.length-1
Also I tried:
val accum = sc.accumulator(0, "My Accumulator")
val buf = new StringBuilder
while(accum.value < 20000)
{
for(i <- eachJson)
{
accum.add(i.getBytes("UTF-8").length)
buf ++= i
}
}
But I receive the following error:
org.apache.spark.SparkException: Task not serializable.
How can I do this in Spark via Scala?
I use Spark 1.6.0 and Scala 2.10.6

Spark's progamming model is not ideal for what you are trying to achieve, if we take the general problem of "aggregating elements depending on something that can only be known by inspecting previous elements", for two reasons :
Spark does not, generally speaking, impose an ordering over the datas (but it can do it)
Sparks deals with datas in partitions, and the sizes of the partitions are not usually (e.g. by default) dependant on the contents of the data, but by a default partitionner whose role is to divide datas evenly into partitions.
So it's not really a question of possible (it is), it rather is a question of "how much does it cost" (CPU / memory / time), for what it buys you.
A draft for an exact solution
If I were to shoot for an exact solution (by exact, I mean : preserving elements order, defined by, e.g. a timestamp in the JSONs, and grouping exactly consecutive inputs to the largest amount that approaches the boundary), I would :
Impose an ordering on the RDD (there is a sortBy function, which does that) : this is a full data shuffle, so it IS expensive.
Give each row an id, after the sort, (there is a RDD version of zipWithIndex which respects ordering on the RDD, if it exists. There is also a faster dataframe equivalent, that creates monotically increasing indexes, albeit non consecutive ones).
Collect the fraction of the result that is necessary to calculate size boundaries (the boundaries being the ids defined at step 2), pretty much as you did. This again is a full pass on the datas.
Create a partitionner of datas that respects these boundaries (e.g. make sure that each elements of a single boundary are all in the same partition), and apply this partitionner to the RDD obtained at step 2 (another full shuffle on the datas). You just got yourself partitions that are logically equivalent to what you expect, e.g. groups of elements whose sum of sizes is under a certain limit. But the ordering inside each partition may have been lost in the repartitionning process. So you are not over yet !
Then I would mapPartitions on this result to :
5.1. resort the datas locally to each partition,
5.2. group items in the data structure I need once sorted
One of the key being not to apply anything that messes with partitions between step 4 and 5.
As long as the "partition map" fits into the driver's memory, this is almost a practical solution, but a very costly one.
A simpler version (with relaxed constraints)
If it is ok for groups not to reach an optimal size, then the solution becomes much simpler (and it respects the ordering of the RDD if you have set one) : it is pretty much what you would code if there was no Spark at all, just an Iterator of JSON files.
Personnaly, I'd define a recursive accumulator function (nothing spark related) like so (I guess you could write your shorter, more efficient version using takeWhile) :
/**
* Aggregate recursively the contents of an iterator into a Seq[Seq[]]
* #param remainingJSONs the remaining original JSON contents to be aggregated
* #param currentAccSize the size of the active accumulation
* #param currentAcc the current aggregation of json strings
* #param resultAccumulation the result of aggregated JSON strings
*/
#tailrec
def acc(remainingJSONs: Iterator[String], currentAccSize: Int, currentAcc: Seq[String], resultAccumulation: Seq[Seq[String]]): Seq[Seq[String]] = {
// IF there is nothing more in the current partition
if (remainingJSONs.isEmpty) {
// And were not in the process of acumulating
if (currentAccSize == 0)
// Then return what was accumulated before
resultAccumulation
else
// Return what was accumulated before, and what was in the process of being accumulated
resultAccumulation :+ currentAcc
} else {
// We still have JSON items to process
val itemToAggregate = remainingJSONs.next()
// Is this item too large for the current accumulation ?
if (currentAccSize + itemToAggregate.size > MAX_SIZE) {
// Finish the current aggregation, and proceed with a fresh one
acc(remainingJSONs, itemToAggregate.size, Seq(itemToAggregate), resultAccumulation :+ currentAcc)
} else {
// Accumulate the current item on top of the current aggregation
acc(remainingJSONs, currentAccSize + itemToAggregate.size, currentAcc :+ itemToAggregate, resultAccumulation)
}
}
}
No you take this accumulating code, and make it run for each partition of spark's dataframe :
val jsonRDD = ...
val groupedJSONs = jsonRDD.mapPartitions(aPartition => {
acc(aPartition, 0, Seq(), Seq()).iterator
})
This will turn your RDD[String] into a RDD[Seq[String]] where each Seq[String] is made of consecutive RDD elements (which may be predictible if the RDD has been sorted, and may not otherwise), whose total length is below the threshold.
What may be "sub-optimal" is that, at the end of each partition, may lie a Seq[String] with just a few (possibly, a single) JSONs, while at the beginning of the following partition, a full one was created.

Not an answer; just to point you to the right direction. You get "Task is not serializable" exception because your val buf = new StringBuilder is used inside RDD's foreach (for(i <- eachJson)). Spark cannot distribute your buf variable as StringBuilder itself is not serializable. Besides you shouldn't access mutable state directly. So recommendation is to put all data you need to Accumulator, not just sizes:
case class MyAccumulator(size: Int, result: String)
And use something like rdd.aggregate or rdd.fold:
eachJson.fold(MyAccumulator(0, ""))(...)
//or
eachJson.fold(List.empty[MyAccumulator])(...)
Or just use it with scanLeft as you collect anyway.
Be aware that this won't be scalable (same as StringBuilder/collect solution). In order to make it scalable - use mapPartitions.
Update. mapPartitions would give you an ability to partially aggregate your JSONs as you would get "local" iterator (partition) as your input - you can operate it as a regular scala collection. It might be enough if you ok with some small percent JSONs not being concatenated.
eachJson.mapPartitions{ localCollection =>
... //compression logic here
}

Related

Read every nth batch in pyarrow.dataset.Dataset

In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()

Convert List<dynamic> to List<String>

I am getting data from server. The run runtimeType shows that they have type List.
Currently I am using cast<String>() to get List<String>.
But is it's only\right way?
var value = await http.get('http://127.0.0.1:5001/regions');
if(value.statusCode == 200) {
return jsonDecode(value.body)['data'].cast<String>();
}
There are multiple ways, depending on how soon you want an error if the list contains a non-string, and how you're going to use the list.
list.cast<String>() creates a lazy wrapper around the original list. It checks on each read that the value is actually a String. If you plan to read often, all that type checking might be expensive, and if you want an early error if the last element of the list is not a string, it won't do that for you.
List<String>.from(list) creates a new list of String and copies each element from list into the new list, checking along the way that it's actually a String. This approach errs early if a value isn't actually a string. After creation, there are no further type checks. On the other hand, creating a new list costs extra memory.
[for (var s in list) s as String],
[... list.cast<String>()],
<String>[for (var s in list) s],
<String>[... list] are all other ways to create a new list of strings. The last two relies on implicit downcast from dynamic, the first two uses explicit casts.
I recommend using list literals where possible. Here, I'd probably go for the smallest version <String>[...list], if you want a new list. Otherwise .cast<String>() is fine.

Dealing with JSON in Scala?

In Scala 2.11, having the below code:
import play.api.libs.json._
...
val data = // read json from file (3)
val JSON: JsValue = Json.parse(data mkString "\n") (4)
val items = JSON \ "items"
for (i <- 0 until 100) yield items(i)
if I unite the last two lines for (i <- 0 until 100) yield (JSON \ "items")(i), will the term JSON \ "items" be evaluated for each i or only once?
is it worth to parallelise the list construction with this
for-expression (I don't care about the order in which items will
appear in the list), where items is an array of JSON objects?
what is the best way to handle exceptions from parsing the JSON in the lines (3 - 4) and validate it?
If you use the expression JSON \ "items" 100 times instead of 1, there'll be 100 times the work to find those nodes - there's no majick memoization or anything like that going on. Your cost is O(n) relative to the number of times you execute it - not O(1). But in any case, for this application the difference is inconsequential - assuming there's no outer loop you're not showing us.
This is way too small beans for parallelization to make any sense - in fact, the overhead might slow things down. If your real case was yield expensiveComputationBasedOn(items(i)), then maybe.
For lines 3-4, yes, use a Try here if you need to handle it here, else farther up (in the method that called the method that called this). In general, catch exceptions at the highest level where you can still provide sufficient information about what went wrong in a log message, where you can do any failure recovery, and where you'll be able to debug. This saves work and makes sure that you catch everything - even what you don't think of. If that's in your "main", fine. An Option will not catch exceptions. Caution: if this is for a class, your teacher may be looking for local error handling, regardless.

Parallel deserialization of Json from a database

This is the scenario: In a separate task I read from a datareader which represent a single column result set with a string, a JSON. In that task I add the JSON string to a BlockingCollection that wraps the ConcurrentQueue. At the same time in the main thread I TryTake/dequeue a JSON string from the collection and then yield return it deserialized.
The reading from the database and the deserialization is approximately of the same speed so there will not be to much memory consumption caused by a large BlockingCollection.
When the reading from the database is done, the task is closed and I then deserialize all the non deserialized JSON strings.
Questions/thoughts:
1) Does the TryTake lock so that no adding can be done?
2) Don't do it. Just do it in serial and yield return.
using (var q = new BlockingCollection<string>())
{
Task task = null;
try
{
task = new Task(() =>
{
foreach (var json in sourceData)
q.Add(json);
});
task.Start();
while (!task.IsCompleted)
{
string json;
if (q.TryTake(out json))
yield return Deserialize<T>(json);
}
Task.WaitAll(task);
}
finally
{
if (task != null)
{
task.Dispose();
}
q.CompleteAdding();
}
foreach (var e in q.GetConsumingEnumerable())
yield return Deserialize<T>(e);
}
Question 1
Does the TryTake lock so that no adding can be done
There will be a very brief period whereby an add cannot be performed, however this time will be negligible. From http://msdn.microsoft.com/en-us/library/dd997305.aspx
Some of the concurrent collection types use lightweight
synchronization mechanisms such as SpinLock, SpinWait, SemaphoreSlim,
and CountdownEvent, which are new in the .NET Framework 4. These
synchronization types typically use busy spinning for brief periods
before they put the thread into a true Wait state. When wait times are
expected to be very short, spinning is far less computationally
expensive than waiting, which involves an expensive kernel transition.
For collection classes that use spinning, this efficiency means that
multiple threads can add and remove items at a very high rate. For
more information about spinning vs. blocking, see SpinLock and
SpinWait.
The ConcurrentQueue and ConcurrentStack classes do not use locks
at all. Instead, they rely on Interlocked operations to achieve
thread-safety.
Question 2:
Don't do it. Just do it in serial and yield return.
This seems like the way to go. As with any optimisation work - do what is simplest and then measure! If there is a bottleneck here consider optimising, but at least you'll know if your 'optimistations' are actually helping by virtue of having metrics to compare against.

Synonym dictionary implementation?

How should I approach this problem? I basically need to implement a dictionary of synonyms. It takes as input some "word/synonim" pairs and I have to be able to "query" it for the list of all synonims of a word.
For example:
Dictionary myDic;
myDic.Add("car", "automobile");
myDic.Add("car", "autovehicle");
myDic.Add("car", "vehicle");
myDic.Add("bike", "vehicle");
myDic.ListOSyns("car") // should return {"automobile","autovehicle","vehicle" ± "car"}
// but "bike" should NOT be among the words returned
I'll code this in C++, but I'm interested in an overall idea of the implementation, so the question is not exactly language-specific.
PS: The main idea is to have some groups of words (synonyms). In the example above there would be two such groups:
{"automobile","autovehicle","vehicle", "car"}
{"bike", "vehicle"}
"vehicle" belongs to both, "bike" just to the second one, the others just to the first
I would implement it as a Graph + hash table / search tree
each keyword would be a Vertex, and each connection between 2 keywords would be an edge.
a hash table or a search tree will connect from each word to its node (and vice versa).
when a query is submitted - you find the node with your hash/tree and do BFS/DFS of the required depth. (meaning you cannot continue after a certain depth)
complexity: O(E(d)+V(d)) for searching graph (d = depth) (E(d) = number of edges in the relevant depth, same for V(d))
O(1) for creating an edge (not including searching for the node, detailed below its search)
O(logn) / O(1) for finding node (for tree/hash table)
O(logn) /O(1) for adding a keyword to the tree/hash table and O(1) to add a Vertex
p.s. as mentioned: the designer should keep in mind if he needs a directed or indirected Graph, as mentioned in the comments to the question.
hope that helps...
With the clarification in the comments to the question, it's relatively simple since you're not storing groups of mutual synonyms, but rather separately defining the acceptable synonyms for each word. The obvious container is either:
std::map<std::string, std::set<std::string> >
or:
std::multi_map<std::string, std::string>
if you're not worried about duplicates being inserted, like this:
myDic.Add("car", "automobile");
myDic.Add("car", "auto");
myDic.Add("car", "automobile");
In the case of multi_map, use the equal_range member function to extract the synonyms for each word, maybe like this:
struct Dictionary {
vector<string> ListOSyns(const string &key) const {
typedef multi_map<string, string>::const_iterator constit;
pair<constit, constit> x = innermap.equal_range(key);
vector<string> retval(x.first, x.second);
retval.push_back(key);
return retval;
}
};
Finally, if you prefer a hashtable-like structure to a tree-like structure, then unordered_multimap might be available in your C++ implementation, and basically the same code works.