Node.js: does JSON.parse block the event loop? - json

using JSON.parse is the most common way to parse a JSON string into a JavaScript object.
it is a synchronous code, but does it actually block the event loop (since its much lower level than the user's code) ?
is there an easy way to parse JSON asynchronously? should it matter at all for few KBs - few hundred KBs of JSON data ?

A function that does not accept a callback or return a promise blocks until it returns a value.
So yes it JSON.parse blocks. Parsing JSON is a CPU intensive task, and JS is single threaded. So the parsing would have to block the main thread at some point. Async only makes sense when waiting on another process or system (which is why disk I/O and networking make good async sense as they have more latency than raw CPU processing).
I'd first prove that parsing JSON is actually a bottleneck for your app before you start optimizing it's parsing. I suspect it's not.

If you think that you might have a lot of heavy JSON decoding to do, consider moving it out to another process. I know it might seem obvious, but the key to using node.js successfully is in the name.
To set up another 'node' top handle a CPU heavy task, use IPC. Simple sockets will do, but ØMQ adds a touch of radioactive magic in that it supports a variety of transports.
It might be that the overhead of connecting a socket and sending the JSON is more intensive overall but it will certainly stop the blocking.

Related

How to apply backpressure to Tcl output channel?

We have an application that allows a user to pass an arbitrary Tcl code block (as a callback) to a custom API that invokes it on individual elements of a large data tree. For performance, this is done using a thread pool, so things can get ripping.
The problem is, we have no control over user code, and in one case they are doing a puts that causes memory to explode and the app to crash. I can prevent the this by redirecting stdout to /dev/null which leads me to believe that Tcl's internal buffers can't be emptied fast enough, so it keeps buffering. Heap analysis seems to confirm this.
What I don't understand is that I haven't messed with any of stdout's options, so it should be line buffered, blocking, 4k. So, my first question would be: why is this happening? Shouldn't there already be backpressure applied to prevent this?
My second question would be: how do I prevent this? If the user wants to to something stupid, I'm more than willing to throttle their performance, but I don't want the app to crash. I suppose one solution would be to redefine puts to write to a file (or simply do nothing) before the callback is invoked, but I'd be interested if there was a way to ensure backpressure on the channel to prevent it from continuing to buffer.
Thanks for any thoughts!
It depends on the channel type and how you've configured it. However, the normal model is that writes to a synchronous channel (-blocking true) will either buffer or write immediately (according to the -buffering option) and writes to an asynchronous channel (-blocking false) will, if not processed immediately, be queued to be carried out later by an internal event handler. For most applications, that does the right thing; it sounds like you've passed an asynchronous channel to code that doesn't call into the event loop (or at least not frequently). Try chan configureing the channel to be synchronous before starting the user code; you're in a separate thread so the blocking behaviour shouldn't be a problem for the rest of the application.
Some channels are more tricky. The one that people most normally encounter is the console channel in Tk on platforms such as Windows, where the channel ends up writing into a widget that doesn't have a maximum number of retained lines.

Why should I not use collect() in my Python Transforms?

TL;DR: I hear rumors that certain PySpark functions aren't advisable in Transforms, but I'm not sure what functions are wrong and why they are so?
Why can't I just collect() my data in certain circumstances to a list and iterate over the rows?
There's a lot of pieces here one needs to understand to arrive at the final conclusion, namely that collect() and other functions are inefficient uses of Spark.
Local vs. Distributed
First, let's cover the difference between local vs. distributed computation. In Spark, the pyspark.sql.functions and pyspark.sql.DataFrame operations you typically execute, such as join(), or groupBy() will delegate execution of these operations to the underlying Spark libraries for maximum possible performance. Think of this as using Python simply as a more convenient language on top of SQL where you are lazily describing the operations you want Spark to go do for you.
In this way, when you stick to SQL operations in PySpark, you can expect highly scalable performance, but only for things you can express in SQL. This is where people can typically take a lazy approach and implement their transformations using for loops instead of thinking about the best possible tactics.
Let's consider the case where you want to simply add a single value to an integer column in your DataFrame. You'll find on Stack Overflow and other places plenty of examples in some more subtle cases where they suggest using a collect() to bring the data into a Python list, looping over every row, and pushing the data back into a DataFrame when finished, which is one tactic you could do here. Let's think about what it means in practice, however: you are bringing your data which is hosted in Spark back to the driver of your build, for looping using a single thread in Python over each row, and adding a constant value to each row one at a time. If we instead found the (obvious in this case) SQL equivalent to this operation, Spark could take your data and in massively parallel add the value to individual rows. Namely, if you have 64 executors (instances of workers available to do the work of your job), then you'll have 64 'cores' (this isn't a perfect analogy but is close) to get the data split and sent to each for adding the value to the column. This will let you dramatically more quickly perform the end result operation you wanted.
Doing work on the driver is what I refer to as 'local' computation, and work in executors as 'parallel'.
This may be an obvious example here, but it often times is tough to remember this difference when dealing with more difficult transformations such as advanced windowing operations or linear algebra computations. Spark has libraries available to do matrix multiplications and manipulations in a distributed fashion, as well as some pretty advanced operations on Windows that require a bit more thinking about your problem first.
Lazy evaluation
The most effective way to use PySpark is to dispatch your 'instructions' on how to build your DataFrame all at once, so that Spark can figure out the best way to materialize this data. In this way, functions that force the computation of a DataFrame so you can inspect it at some point in your code should be avoided if at all possible; they mean Spark is working extra to satisfy your print() statement or other method call instead of working towards writing out your data.
Python in Java in Scala
The Python runtime is actually executing inside a JVM that is in turn talking to the Spark runtime, which is written in Scala. So, for every call to collect() where you wish to materialize your data in Python, Spark must materialize your data into a single, locally-available DataFrame, then synthesize this from Scala to its Java equivalent, then finally pass from the JVM to the Python equivalents before it is available to iterate over. This is an incredibly inefficient process that isn't possible to parallelize.
This results in operations that render your data to Python being highly advisable to avoid.
Functions to avoid
So, what functions should you avoid?
collect
head
take
first
show
Each of these methods will force execution on the DataFrame and bring the results back to the Python runtime for display / use. This means Spark won't have the opportunity to lazily figure out the most efficient way to compute upon your data and will instead be forced to bring back the data requested before proceeding with any other execution.

Guidance as to when querying the database for read operation should be done using mass transit request/response for timebound operation

For Create operations it is clear that putting the message in the queue is a good idea in case the processing or creation of that entity takes longer than expected and other the other benefits queues bring.
However, for read operations that are timebound (must return to the UI in less than 3 seconds) it is not entirely clear if a queue is a good idea.
http://masstransit-project.com/MassTransit/usage/request-response.html provides a nice abstraction but it goes through the queue.
Can someone provide some suggestions as to why or why not I would use mass transit or that effect any technology like nservicebus etc for database read operation that are UI timebound?
Should I only use mass transit only for long running processes?
Request/Reply is a perfectly valid pattern for timebound operations. Transport costs in case of, for example, RabbitMQ, are very low. I measured performance of request/response using ServiceStack (which is very fast) and MassTransit. There is an initial delay with MassTransit to cache the endpoints, but apart from that the speed is pretty much the same.
Benefits here are:
Retries
Fine tuning of timeouts
Easy scaling with competing consumers
just to name the most obvious ones.
And with error handling you get your requests ending up in the error queue so there is no data loss and you can always look there to find out what and why went wrong.
Update: There is a SOA pattern that describes this (or rather similar) approach. It is called Decoupled Invocation.

Using eventlet to process large number of https requests with large data returns

I am trying to use eventlets to process a large number of data requests, approx. 100,000 requests at a time to a remote server, each of which should generate a 10k-15k byte JSON response. I have to decode the JSON, then perform some data transformations (some field name changes, some simple transforms like English->metric, but a few require minor parsing), and send all 100,000 requests out the back end as XML in a couple of formats expected by a legacy system. I'm using the code from the eventlet example which uses imap() "for body in pool.imap(fetch, urls):...."; lightly modified. eventlet is working well so far on a small sample (5K urls), to fetch the JSON data. My question is whether I should add the non-I/O processing (JSON decode, field transform, XML encode) to the "fetch()" function so that all that transform processing happens in the greenthread, or should I do the bare minimum in the greenthread, return the raw response body, and do the main processing in the "for body in pool.imap():" loop? I'm concerned that if I do the latter, the amount of data from completed threads will start building up, and will bloat memory, where doing the former would essentially throttle the process to where the XML output would keep up. Suggestions as to preferred method to implement this welcome. Oh, and this will eventually run off of cron hourly, so it really has a time window it has to fit into. Thanks!
Ideally, you put each data processing operation into separate green thread. Then, only when required, combine several operations into batch or use a pool to throttle concurrency.
When you do non-IO-bound processing in one loop, essentially you throttle concurrency to 1 simultaneous task. But you can run those in parallel using (OS) thread pool in eventlet.tpool module.
Throttle concurrency only when you have too many parallel CPU-bound code running.

speed of decoding and encoding json in perl

I'am working on a small perl script. And I store the data using JSON.
I decode the JSON string using from_json an encode with to_json.
To be more specific:
The data scale could be something like 100,000 items in a hash
The data is stored in a file in the disk.
So to decode it, I'll have to read it from the disk first
And my question is:
There is a huge difference in the speed between the decoding and encoding process.
The encoding process seems to be much faster than the decoding process.
And I wonder what makes that difference ?
Parsing is much more computationally expensive than formatting.
from_json has to parse the json structures and convert them into perl data structures, to_json merely has to iterate through the data structure and "print" out each item in a formatted way.
Parsing is a complex topic that still is the focus of CS theory work. However at the base level, parsing is a 2 step operation. You need to parse the input stream for tokens and then validate the sequence of tokens as a valid statement in the language. Encoding is on the other hand a single step operation, you already know it's valid, you simply have to convert it to the representation.
JSON (the module) is not a parser/encoder. It's merely a front-end for JSON::XS (very fast) or JSON::PP (not so much). JSON will use JSON::XS if it's installed, but defaults to JSON::PP if it's not. You might see very different numbers depending on whether you have JSON::XS installed or not.
I could see a Perl parser (like JSON::PP) having varying performances for encoding and decoding because it's hard to write something optimal because of all the overhead, but the difference should be much smaller using JSON::XS.
It might still be a bit slower to decode using JSON::XS because of all the memory blocks it has to allocate. Allocating memory is a relatively expensive process, and it needs to be done fare more time when decoding than when encoding. For example, a Perl string consists of a three memory blocks (scalar head, scalar body and the string buffer itself). When encoding, allocating memory is only done when the output buffer needs to be enlarged.