Parallel deserialization of Json from a database - json

This is the scenario: In a separate task I read from a datareader which represent a single column result set with a string, a JSON. In that task I add the JSON string to a BlockingCollection that wraps the ConcurrentQueue. At the same time in the main thread I TryTake/dequeue a JSON string from the collection and then yield return it deserialized.
The reading from the database and the deserialization is approximately of the same speed so there will not be to much memory consumption caused by a large BlockingCollection.
When the reading from the database is done, the task is closed and I then deserialize all the non deserialized JSON strings.
Questions/thoughts:
1) Does the TryTake lock so that no adding can be done?
2) Don't do it. Just do it in serial and yield return.
using (var q = new BlockingCollection<string>())
{
Task task = null;
try
{
task = new Task(() =>
{
foreach (var json in sourceData)
q.Add(json);
});
task.Start();
while (!task.IsCompleted)
{
string json;
if (q.TryTake(out json))
yield return Deserialize<T>(json);
}
Task.WaitAll(task);
}
finally
{
if (task != null)
{
task.Dispose();
}
q.CompleteAdding();
}
foreach (var e in q.GetConsumingEnumerable())
yield return Deserialize<T>(e);
}

Question 1
Does the TryTake lock so that no adding can be done
There will be a very brief period whereby an add cannot be performed, however this time will be negligible. From http://msdn.microsoft.com/en-us/library/dd997305.aspx
Some of the concurrent collection types use lightweight
synchronization mechanisms such as SpinLock, SpinWait, SemaphoreSlim,
and CountdownEvent, which are new in the .NET Framework 4. These
synchronization types typically use busy spinning for brief periods
before they put the thread into a true Wait state. When wait times are
expected to be very short, spinning is far less computationally
expensive than waiting, which involves an expensive kernel transition.
For collection classes that use spinning, this efficiency means that
multiple threads can add and remove items at a very high rate. For
more information about spinning vs. blocking, see SpinLock and
SpinWait.
The ConcurrentQueue and ConcurrentStack classes do not use locks
at all. Instead, they rely on Interlocked operations to achieve
thread-safety.
Question 2:
Don't do it. Just do it in serial and yield return.
This seems like the way to go. As with any optimisation work - do what is simplest and then measure! If there is a bottleneck here consider optimising, but at least you'll know if your 'optimistations' are actually helping by virtue of having metrics to compare against.

Related

GCP dataflow - processing JSON takes too long

I am trying to process json files in a bucket and write the results into a bucket:
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
options.setProject("the-project");
options.setStagingLocation("gs://some-bucket/temp/");
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://some-bucket/2016/04/28/*/*.json"))
.apply(ParDo.named("SanitizeJson").of(new DoFn<String, String>() {
#Override
public void processElement(ProcessContext c) {
try {
JsonFactory factory = JacksonFactory.getDefaultInstance();
String json = c.element();
SomeClass e = factory.fromString(json, SomeClass.class);
// manipulate the object a bit...
c.output(factory.toString(e));
} catch (Exception err) {
LOG.error("Failed to process element: " + c.element(), err);
}
}
}))
.apply(TextIO.Write.to("gs://some-bucket/output/"));
p.run();
I have around 50,000 files under the path gs://some-bucket/2016/04/28/ (in sub-directories).
My question is: does it make sense that this takes more than an hour to complete? Doing something similar on a Spark cluster in amazon takes about 15-20 minutes. I suspect that I might be doing something inefficiently.
EDIT:
In my Spark job I aggregate all the results in a DataFrame and only then write the output, all at once. I noticed that my pipeline here writes each file separately, I assume that is why it's taking much longer. Is there a way to change this behavior?
Your jobs are hitting a couple of performance issues in Dataflow, caused by the fact that it is more optimized for executing work in larger increments, while your job is processing lots of very small files. As a result, some aspects of the job's execution end up dominated by per-file overhead. Here's some details and suggestions.
The job is limited rather by writing output than by reading input (though reading input is also a significant part). You can significantly cut that overhead by specifying withNumShards on your TextIO.Write, depending on how many files you want in the output. E.g. 100 could be a reasonable value. By default you're getting an unspecified number of files which in this case, given current behavior of the Dataflow optimizer, matches number of input files: usually it is a good idea because it allows us to not materialize the intermediate data, but in this case it's not a good idea because the input files are so small and per-file overhead is more important.
I recommend to set maxNumWorkers to a value like e.g. 12 - currently the second job is autoscaling to an excessively large number of workers. This is caused by Dataflow's autoscaling currently being geared toward jobs that process data in larger increments - it currently doesn't take into account per-file overhead and behaves not so well in your case.
The second job is also hitting a bug because of which it fails to finalize the written output. We're investigating, however setting maxNumWorkers should also make it complete successfully.
To put it shortly:
set maxNumWorkers=12
set TextIO.Write.to("...").withNumShards(100)
and it should run much better.

Do Couchbase reactive clients guarantee order of rows in view query result

I use Couchbase Java SDK 2.2.6 with Couchbase server 4.1.
I query my view with the following code
public <T> List<T> findDocuments(ViewQuery query, String bucketAlias, Class<T> clazz) {
// We specifically set reduce false and include docs to retrieve docs
query.reduce(false).includeDocs();
log.debug("Find all documents, query = {}", decode(query));
return getBucket(bucketAlias)
.query(query)
.allRows()
.stream()
.map(row -> fromJsonDocument(row.document(), clazz))
.collect(Collectors.toList());
}
private static <A> A fromJsonDocument(JsonDocument saved, Class<A> clazz) {
log.debug("Retrieved json document -> {}", saved);
A object = fromJson(saved.content(), clazz);
return object;
}
In the logs from the fromJsonDocument method I see that rows are not always sorted by the row key. Usually they are, but sometimes they are not.
If I just run this query in browser couchbase GUI, I always receive results in expected order. Is it a bug or expected that view query results are not sorted when queried with async client?
What is the behaviour in different clients, not java?
This is due to the asynchronous nature of your call in the Java client + the fact that you used includeDocs.
What includeDocs will do is that it will weave in a call to get for each document id received from the view. So when you look at the asynchronous sequence of AsyncViewRow with includeDocs, you're actually looking at a composition of a row returned by the view and an asynchronous retrieval of the whole document.
If a document retrieval has a little bit of latency compared to the one for the previous row, it could reorder the (row+doc) emission.
But good news everyone! There is a includeDocsOrdered alternative in the ViewQuery that takes exactly the same parameters as includeDocs but will ensure that AsyncViewRow come in the same order returned by the view.
This is done by eagerly triggering the get retrievals but then buffering those that arrive out of order, so as to maintain the original order without sacrificing too much performance.
That is quite specific to the Java client, with its usage of RxJava. I'm not even sure other clients have the notion of includeDocs...

Perl, Children, and Shared Data

I'm working with a database that holds lots of urls (tens of thousands). I'm attempting to multi-thread a resolver, that simply tries to resolve a given domain. On success, it compares the result to what's currently in the database. If it's different, the result is updated. If it fails, it's also updated.
Naturally, this will produce an inordinate volume of database calls. To clarify some of my confusion about the best way to achieve some form of asynchronous load distribution, I have the following questions (being fairly new to Perl still).
What is the best option for distributing the workload? Why?
How should I gather the urls to resolve prior to spawning?
Creating a hash of domains with the data to be compared seems to make the most sense to me. Then split it up, fire up children, children return changes to be made to parent
How should returning data to the parent be handled in a clean manner?
I've been playing with a more pythonic method (given that I have more experience in Python), but have yet to make it work due to a lack of blocking for some reason. Asside from that issue, threading isn't the best option simply due to (a lack of) CPU time for each thread (plus, I've been crucified more than once in the Perl channel for using threads :P and for good reason)
Below is more or less psuedo-code that I've been playing with for my threads (which should be used more as a supplement to my explanation of what I'm trying to accomplish, than anything).
# Create children...
for (my $i = 0; $i < $threads_to_spawn; $i++ )
{
threads->create(\&worker);
}
The parent then sits in a loop, monitoring a shared array of domains. It locks and re-populates it if it becomes empty.
Your code is the start of a persistent worker model.
use threads;
use Thread::Queue 1.03 qw( );
use constant NUM_WORKERS => 5;
sub work {
my ($dbh, $job) = #_;
...
}
{
my $q = Thread::Queue->new();
for (1..NUM_WORKERS) {
async {
my $dbh = ...;
while (my $job = $q->dequeue())
work($dbh, $job);
}
};
}
for my $job (...) {
$q->enqueue($job);
}
$q->end();
$_->join() for threads->list();
}
Performance tips:
Tweak the number of workers for your system and workload.
Grouping small jobs into larger jobs can improve speed by reducing overhead.

Deserializing unreliable JSON efficiently

I am trying to deserialize JSON from the server which sometimes omits certain key/values, which prevented me from using it:
var obj = JsonConvert.Deserialize<Obj>(respString);
I read up the JSON.net documentation and went ahead with his Linq to JSON deserialization which looks like this:
string respString = await resp.Content.ReadAsStringAsync();
JObject json = JObject.Parse(respString);
...
CustomObject.child.primary_key = int.Parse(obj[i]["primary_key"].ToString());
CustomObject.foreign_key = obj[i]["foreign_key"].ToString();
CustomObject.properties = int.Parse(ojb[i]["properties"].ToString());
try //this is being done a few times throughout the loop
{
CustomObject.unreliableProperty = obj[i]["xxx"].ToString();
}
Catch { }
Everything works fine but I occasionally get outofmemoryexceptions (the JSON from the server is quite big). I've tried implementing a custom JsonConverter from JSON.net but with the nested hierarchy (3-level depth), the JsonConverter gets too complicated and hard to understand, not to mention maintain.
When I run the memory profiler, I saw peaks above 150mb, which would cause the application to have OOM exceptions. Under the heap section, I saw 30-40% of the memory allocations are done by mscorlib due to string allocation. Another problem was during the loop, there was a continuous streak of GC going on. I am utterly disgusted by the way it's being written now, but I have reached my wit's end, can anyone help?

Is LINQ lazy-evaluated?

Greetings, I have the following question. I did not find the exact answer for it, and it's really interesting to me. Suppose I have the following code that retrieves the records from database (in order to export it to XML file, for example).
var result = from emps in dc.Employees
where emps.age > 21
select emps;
foreach (var emp in result) {
// Append this record in suitable format to the end of XML file
}
Suppose there is a million of records that satisfy the where condition in the code. What will happen? All this data will be retrieved from SQL Server immediately to the runtime memory when it reaches the foreach construct, or it will be retrieved then necessary, the first record, second. In other words, does LINQ really handle the situation with iterating through large collections (see my post here for details)?
If not, how to overcome the memory issues in that case? If I really need to traverse the large collection, what should I do? Calculate the actual amount of elements in collection with help of Count function, and after that read the data from the database by small portions. Is there an easy way to implement paging with LINQ framework?
All the data will be retrieved from SQL Server, at one time, and put into memory. The only way around this that I can think of is to process data in smaller chunks (like page using Skip() and Take()). But, of course, this requires more hits to SQL Server.
Here is a Linq paging extension method I wrote to do this:
public static IEnumerable<TSource> Page<TSource>(this IEnumerable<TSource> source, int pageNumber, int pageSize)
{
return source.Skip((pageNumber - 1) * pageSize).Take(pageSize);
}
Yes, LINQ uses lazy evaluation. The database would be queried when the foreach starts to execute, but it would fetch all the data in one go (it would be much less efficient to do millions of queries for just one result at a time).
If you're worried about bringing in too many results in one go, you could use Skip and Top to only get a limited number of results at a time (thus paginating your result).
It'll be retrieved when you invoke ToList or similar methods. LINQ has deferred execution:
http://weblogs.asp.net/psteele/archive/2008/04/18/linq-deferred-execution.aspx
The way - even having deferred execution and loading the entire collection from a data source in the case of an OR/M or any other LINQ provider - will be determined by the implementer of the LINQ object source.
That's, for example, some OR/M may provide lazy-loading, and that means your "entire list of customers" would be something like a cursor, and accessing one of items (an employee), and also one property, will load the employee itself or the accessed property only.
But, anyway, these are the basics.
EDIT: Now I see it's a LINQ-to-SQL thing... Or I don't know if question's author misunderstood LINQ and he doesn't know LINQ isn't LINQ-to-SQL, but it's more a pattern and a language feature.
OK, now thanks to this answer I have an idea - how about mixing the function of taking a page with yield return possibilities? Here is the sample of code:
// This is the original function that takes the page
public static IEnumerable<TSource> Page<TSource>(this IEnumerable<TSource> source, int pageNumber, int pageSize) {
return source.Skip((pageNumber - 1) * pageSize).Take(pageSize);
}
// And here is the function with yield implementation
public static IEnumerable<TSource> Lazy<TSource>(this IEnumerable<TSource> source, int pageSize) {
int pageNumber = 1;
int count = 0;
do {
IEnumerable<TSource> coll = Page(source, pageNumber, pageSize);
count = coll.Count();
pageNumber++;
yield return coll;
} while (count > 0);
}
// And here goes our code for traversing collection with paging and foreach
var result = from emps in dc.Employees
where emps.age > 21
select emps;
// Let's use the 1000 page size
foreach (var emp in Lazy(result, 1000)) {
// Append this record in suitable format to the end of XML file
}
I think this way we can overcome the memory issue, yet leaving the syntaxis of foreach not so complicated.