Perl, Children, and Shared Data - mysql

I'm working with a database that holds lots of urls (tens of thousands). I'm attempting to multi-thread a resolver, that simply tries to resolve a given domain. On success, it compares the result to what's currently in the database. If it's different, the result is updated. If it fails, it's also updated.
Naturally, this will produce an inordinate volume of database calls. To clarify some of my confusion about the best way to achieve some form of asynchronous load distribution, I have the following questions (being fairly new to Perl still).
What is the best option for distributing the workload? Why?
How should I gather the urls to resolve prior to spawning?
Creating a hash of domains with the data to be compared seems to make the most sense to me. Then split it up, fire up children, children return changes to be made to parent
How should returning data to the parent be handled in a clean manner?
I've been playing with a more pythonic method (given that I have more experience in Python), but have yet to make it work due to a lack of blocking for some reason. Asside from that issue, threading isn't the best option simply due to (a lack of) CPU time for each thread (plus, I've been crucified more than once in the Perl channel for using threads :P and for good reason)
Below is more or less psuedo-code that I've been playing with for my threads (which should be used more as a supplement to my explanation of what I'm trying to accomplish, than anything).
# Create children...
for (my $i = 0; $i < $threads_to_spawn; $i++ )
{
threads->create(\&worker);
}
The parent then sits in a loop, monitoring a shared array of domains. It locks and re-populates it if it becomes empty.

Your code is the start of a persistent worker model.
use threads;
use Thread::Queue 1.03 qw( );
use constant NUM_WORKERS => 5;
sub work {
my ($dbh, $job) = #_;
...
}
{
my $q = Thread::Queue->new();
for (1..NUM_WORKERS) {
async {
my $dbh = ...;
while (my $job = $q->dequeue())
work($dbh, $job);
}
};
}
for my $job (...) {
$q->enqueue($job);
}
$q->end();
$_->join() for threads->list();
}
Performance tips:
Tweak the number of workers for your system and workload.
Grouping small jobs into larger jobs can improve speed by reducing overhead.

Related

GCP dataflow - processing JSON takes too long

I am trying to process json files in a bucket and write the results into a bucket:
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
options.setProject("the-project");
options.setStagingLocation("gs://some-bucket/temp/");
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://some-bucket/2016/04/28/*/*.json"))
.apply(ParDo.named("SanitizeJson").of(new DoFn<String, String>() {
#Override
public void processElement(ProcessContext c) {
try {
JsonFactory factory = JacksonFactory.getDefaultInstance();
String json = c.element();
SomeClass e = factory.fromString(json, SomeClass.class);
// manipulate the object a bit...
c.output(factory.toString(e));
} catch (Exception err) {
LOG.error("Failed to process element: " + c.element(), err);
}
}
}))
.apply(TextIO.Write.to("gs://some-bucket/output/"));
p.run();
I have around 50,000 files under the path gs://some-bucket/2016/04/28/ (in sub-directories).
My question is: does it make sense that this takes more than an hour to complete? Doing something similar on a Spark cluster in amazon takes about 15-20 minutes. I suspect that I might be doing something inefficiently.
EDIT:
In my Spark job I aggregate all the results in a DataFrame and only then write the output, all at once. I noticed that my pipeline here writes each file separately, I assume that is why it's taking much longer. Is there a way to change this behavior?
Your jobs are hitting a couple of performance issues in Dataflow, caused by the fact that it is more optimized for executing work in larger increments, while your job is processing lots of very small files. As a result, some aspects of the job's execution end up dominated by per-file overhead. Here's some details and suggestions.
The job is limited rather by writing output than by reading input (though reading input is also a significant part). You can significantly cut that overhead by specifying withNumShards on your TextIO.Write, depending on how many files you want in the output. E.g. 100 could be a reasonable value. By default you're getting an unspecified number of files which in this case, given current behavior of the Dataflow optimizer, matches number of input files: usually it is a good idea because it allows us to not materialize the intermediate data, but in this case it's not a good idea because the input files are so small and per-file overhead is more important.
I recommend to set maxNumWorkers to a value like e.g. 12 - currently the second job is autoscaling to an excessively large number of workers. This is caused by Dataflow's autoscaling currently being geared toward jobs that process data in larger increments - it currently doesn't take into account per-file overhead and behaves not so well in your case.
The second job is also hitting a bug because of which it fails to finalize the written output. We're investigating, however setting maxNumWorkers should also make it complete successfully.
To put it shortly:
set maxNumWorkers=12
set TextIO.Write.to("...").withNumShards(100)
and it should run much better.

Nesting Asynchronous Promises in ActionScript

I have a situation where I need to perform dependent asynchronous operations. For example, check the database for data, if there is data, perform a database write (insert/update), if not continue without doing anything. I have written myself a promise based database API using promise-as3. Any database operation returns a promise that is resolved with the data of a read query, or with the Result object(s) of a write query. I do the following to nest promises and create one point of resolution or rejection for the entire 'initialize' operation.
public function initializeTable():Promise
{
var dfd:Deferred = new Deferred();
select("SELECT * FROM table").then(tableDefaults).then(resolveDeferred(dfd)).otherwise(errorHandler(dfd));
return dfd.promise;
}
public function tableDefaults(data:Array):Promise
{
if(!data || !data.length)
{
//defaultParams is an Object of table default fields/values.
return insert("table", defaultParams);
} else
{
var resolved:Deferred = new Deferred();
resolved.resolve(null);
return resolved.promise;
}
}
public function resolveDeferred(deferred:Deferred):Function
{
return function resolver(value:*=null):void
{
deferred.resolve(value);
}
}
public function rejectDeferred(deferred:Deferred):Function
{
return function rejector(reason:*=null):void
{
deferred.reject(reason);
}
}
My main questions:
Are there any performance issues that will arise from this? Memory leaks etc.? I've read that function variables perform poorly, but I don't see another way to nest operations so logically.
Would it be better to have say a global resolved instance that is created and resolved only once, but returned whenever we need an 'empty' promise?
EDIT:
I'm removing question 3 (Is there a better way to do this??), as it seems to be leading to opinions on the nature of promises in asynchronous programming. I meant better in the scope of promises, not asynchronicity in general. Assume you have to use this promise based API for the sake of the question.
I usually don't write those kind of opinion based answers, but here it's pretty important. Promises in AS3 = THE ROOTS OF ALL EVIL :) And I'll explain you why..
First, as BotMaster said - it's weakly typed. What this means is that you don't use AS3 properly. And the only reason this is possible is because of backwards compatibility. The true here is, that Adobe have spent thousands of times so that they can turn AS3 into strongly type OOP language. Don't stray away from that.
The second point is that Promises, at first place, are created so that poor developers can actually start doing some job in JavaScript. This is not a new design pattern or something. Actually, it has no real benefits if you know how to structure your code properly. The thing that Promises help the most, is avoiding the so called Wall of Hell. But there are other ways to fix this in a natural manner (the very very basic thing is not to write functions within functions, but on the same level, and simply check the passed result).
The most important here is the nature of Promises. Very few people know what they actually do behind the scenes. Because of the nature of JavaScript (and ECMA script at all), there is no real way to tell if a function completed properly or not. If you return false / null / undefined - they are all regular return values. The only way they could actually say "this operation failed" is by throwing an error. So every promisified method, can potentially throw an error. And each error must be handled, or otherwise your code can stop working properly. What this means, is that every single action inside Promise is within try-catch block! Every time you do absolutely basic stuff, you wrap it in try-catch. Even this block of yours:
else
{
var resolved:Deferred = new Deferred();
resolved.resolve(null);
return resolved.promise;
}
In a "regular" way, you would simply use else { return null }. But now, you create tons of objects, resolvers, rejectors, and finally - you try-catch this block.
I cannot stress more on this, but I think you are getting the point. Try-catch is extremely slow! I understand that this is not a big problem in such a simple case like the one I just mentioned, but imagine you are doing it more and on more heavy methods. You are just doing extremely slow operations, for what? Because you can write lame code and just enjoy it..
The last thing to say - there are plenty of ways to use asynchronous operations and make them work one after another. Just by googling as3 function queue I found a few. Not to say that the event-based system is so flexible, and there are even alternatives to it (using callbacks). You've got it all in your hands, and you turn to something that is created because lacking proper ways to do it otherwise.
So my sincere advise as a person worked with Flash for a decade, doing casino games in big teams, would be - don't ever try using promises in AS3. Good luck!
var dfd:Deferred = new Deferred();
select("SELECT * FROM table").then(tableDefaults).then(resolveDeferred(dfd)).otherwise(errorHandler(dfd));
return dfd.promise;
This is the The Forgotten Promise antipattern. It can instead be written as:
return select("SELECT * FROM table").then(tableDefaults);
This removes the need for the resolveDeferred and rejectDeferred functions.
var resolved:Deferred = new Deferred();
resolved.resolve(null);
return resolved.promise;
I would either extract this to another function, or use Promise.when(null). A global instance wouldn't work because it would mean than the result handlers from one call can be called for a different one.

A tool to detect unnecessary recursive calls in a program?

A very common beginner mistake when writing recursive functions is to accidentally fire off completely redundant recursive calls from the same function. For example, consider this recursive function that finds the maximum value in a binary tree (not a binary search tree):
int BinaryTreeMax(Tree* root) {
if (root == null) return INT_MIN;
int maxValue = root->value;
if (maxValue < BinaryTreeMax(root->left))
maxValue = BinaryTreeMax(root->left); // (1)
if (maxValue < BinaryTreeMax(root->right))
maxValue = BinaryTreeMax(root->right); // (2)
return maxValue;
}
Notice that this program potentially makes two completely redundant recursive calls to BinaryTreeMax in lines (1) and (2). We could rewrite this code so that there's no need for these extra calls by simply caching the value from before:
int BinaryTreeMax(Tree* root) {
if (root == null) return INT_MIN;
int maxValue = root->value;
int leftValue = BinaryTreeMax(root->left);
int rightValue = BinaryTreeMax(root->right);
if (maxValue < leftValue)
maxValue = leftValue;
if (maxValue < rightValue)
maxValue = rightValue;
return maxValue;
}
Now, we always make exactly two recursive calls.
My question is whether there is a tool that does either a static or dynamic analysis of a program (in whatever language you'd like; I'm not too picky!) that can detect whether a program is making completely unnecessary recursive calls. By "completely unnecessary" I mean that
The recursive call has been made before,
by the same invocation of the recursive function (or one of its descendants), and
the call itself has no observable side-effects.
This is something that can usually be determined by hand, but I think it would be great if there were some tool that could flag things like this automatically as a way of helping students gain feedback about how to avoid making simple but expensive mistakes in their programs that could contribute to huge inefficiencies.
Does anyone know of such a tool?
First, your definition of 'completely unnecessary' is insufficient. It is possible that some code between the two function calls affects the result of the second function call.
Second, this has nothing to do with recursion, the same question can apply to any function call. If it has been called before with the exact same parameters, has no side-effects, and no code between the two calls changed any data the function accesses.
Now, I'm pretty sure a perfect solution is impossible, as it would solve The Halting Problem, but that doesn't mean there isn't a way to detect enough of these cases and optimize away some of them.
Some compilers know how to do that (GCC has a specific flag that warns you when it does so). Here's a 2003 article I found about the issue: http://www.cs.cmu.edu/~jsstylos/15745/final.pdf .
I couldn't find a tool for this, though, but that's probably something Eric Lipert knows, if he happens to bump into your question.
Some compilers (such as GCC) do have ways to mark determinate functions explicitly (to be more precise, __attribute__((const)) (see GCC function attributes) applies some restrictions onto the function body to make its result depend only from its argument and get no depency from shared state of program or other non-deterministic functions). Then they eliminate duplicate calls to costy functions. Some other high-level language implementations (may be Haskell) does this tests automatically.
Really, I don't know tools for such analysis (but if i find it i will be happy). And if there is one that correcly detects unnecessary recursion or, in general way, function evaluation (in language-agnostic environment) it would be a kind of determinacy prover.
BTW, it's not so difficult to write such program when you already have access to semantic tree of the code :)

Parallel deserialization of Json from a database

This is the scenario: In a separate task I read from a datareader which represent a single column result set with a string, a JSON. In that task I add the JSON string to a BlockingCollection that wraps the ConcurrentQueue. At the same time in the main thread I TryTake/dequeue a JSON string from the collection and then yield return it deserialized.
The reading from the database and the deserialization is approximately of the same speed so there will not be to much memory consumption caused by a large BlockingCollection.
When the reading from the database is done, the task is closed and I then deserialize all the non deserialized JSON strings.
Questions/thoughts:
1) Does the TryTake lock so that no adding can be done?
2) Don't do it. Just do it in serial and yield return.
using (var q = new BlockingCollection<string>())
{
Task task = null;
try
{
task = new Task(() =>
{
foreach (var json in sourceData)
q.Add(json);
});
task.Start();
while (!task.IsCompleted)
{
string json;
if (q.TryTake(out json))
yield return Deserialize<T>(json);
}
Task.WaitAll(task);
}
finally
{
if (task != null)
{
task.Dispose();
}
q.CompleteAdding();
}
foreach (var e in q.GetConsumingEnumerable())
yield return Deserialize<T>(e);
}
Question 1
Does the TryTake lock so that no adding can be done
There will be a very brief period whereby an add cannot be performed, however this time will be negligible. From http://msdn.microsoft.com/en-us/library/dd997305.aspx
Some of the concurrent collection types use lightweight
synchronization mechanisms such as SpinLock, SpinWait, SemaphoreSlim,
and CountdownEvent, which are new in the .NET Framework 4. These
synchronization types typically use busy spinning for brief periods
before they put the thread into a true Wait state. When wait times are
expected to be very short, spinning is far less computationally
expensive than waiting, which involves an expensive kernel transition.
For collection classes that use spinning, this efficiency means that
multiple threads can add and remove items at a very high rate. For
more information about spinning vs. blocking, see SpinLock and
SpinWait.
The ConcurrentQueue and ConcurrentStack classes do not use locks
at all. Instead, they rely on Interlocked operations to achieve
thread-safety.
Question 2:
Don't do it. Just do it in serial and yield return.
This seems like the way to go. As with any optimisation work - do what is simplest and then measure! If there is a bottleneck here consider optimising, but at least you'll know if your 'optimistations' are actually helping by virtue of having metrics to compare against.

Successive success checks

Most of you have probably bumped into a situation, where multiple things must be in check and in certain order before the application can proceed, for example in a very simple case of creating a listening socket (socket, bind, listen, accept etc.). There are at least two obvious ways (don't take this 100% verbatim):
if (1st_ok)
{
if (2nd_ok)
{
...
or
if (!1st_ok)
{
return;
}
if (!2nd_ok)
{
return;
}
...
Have you ever though of anything smarter, do you prefer one over the other of the above, or do you (if the language provides for it) use exceptions?
I prefer the second technique. The main problem with the first one is that it increases the nesting depth of the code, which is a significant issue when you've got a substantial number of preconditions/resource-allocs to check since the business part of the function ends up deeply buried behind a wall of conditions (and frequently loops too). In the second case, you can simplify the conceptual logic to "we've got here and everything's OK", which is much easier to work with. Keeping the normal case as straight-line as possible is just easier to grok, especially when doing maintenance coding.
It depends on the language - e.g. in C++ you might well use exceptions, while in C you might use one of several strategies:
if/else blocks
goto (one of the few cases where a single goto label for "exception" handling might be justified
use break within a do { ... } while (0) loop
Personally I don't like multiple return statements in a function - I prefer to have a common clean up block at the end of the function followed by a single return statement.
This tends to be a matter of style. Some people only like returning at the end of a procedure, others prefer to do it wherever needed.
I'm a fan of the second method, as it allows for clean and concise code as well as ease of adding documentation on what it's doing.
// Checking for llama integration
if (!1st_ok)
{
return;
}
// Llama found, loading spitting capacity
if (!2nd_ok)
{
return;
}
// Etc.
I prefer the second version.
In the normal case, all code between the checks executes sequentially, so I like to see them at the same level. Normally none of the if branches are executed, so I want them to be as unobtrusive as possible.
I use 2nd because I think It reads better and easier to follow the logic. Also they say exceptions should not be used for flow control, but for the exceptional and unexpected cases. Id like to see what pros say about this.
What about
if (1st_ok && 2nd_ok) { }
or if some work must be done, like in your example with sockets
if (1st_ok() && 2nd_ok()) { }
I avoid the first solution because of nesting.
I avoid the second solution because of corporate coding rules which forbid multiple return in a function body.
Of course coding rules also forbid goto.
My workaround is to use a local variable:
bool isFailed = false; // or whatever is available for bool/true/false
if (!check1) {
log_error();
try_recovery_action();
isFailed = true;
}
if (!isfailed) {
if (!check2) {
log_error();
try_recovery_action();
isFailed = true;
}
}
...
This is not as beautiful as I would like but it is the best I've found to conform to my constraints and to write a readable code.
For what it is worth, here are some of my thoughts and experiences on this question.
Personally, I tend to prefer the second case you outlined. I find it easier to follow (and debug) the code. That is, as the code progresses, it becomes "more correct". In my own experience, this has seemed to be the preferred method.
I don't know how common it is in the field, but I've also seen condition testing written as ...
error = foo1 ();
if ((error == OK) && test1)) {
error = foo2 ();
}
if ((error == OK) && (test2)) {
error = foo3 ();
}
...
return (error);
Although readable (always a plus in my books) and avoiding deep nesting, it always struck me as using a lot of unnecessary testing to achieve those ends.
The first method, I see used less frequently than the second. Of those times, the vast majority of the time was because there was no nice way around it. For the remaining few instances, it was justified on the basis of extracting a little more performance on the success case. The argument was that the processor would predict a forward branch as not taken (corresponding to the else clause). This depended upon several factors including, the architecture, compiler, language, need, .... Obviously most projects (and most aspects of the project) did not meet those requirements.
Hope this helps.