Why should I not use collect() in my Python Transforms? - palantir-foundry

TL;DR: I hear rumors that certain PySpark functions aren't advisable in Transforms, but I'm not sure what functions are wrong and why they are so?
Why can't I just collect() my data in certain circumstances to a list and iterate over the rows?

There's a lot of pieces here one needs to understand to arrive at the final conclusion, namely that collect() and other functions are inefficient uses of Spark.
Local vs. Distributed
First, let's cover the difference between local vs. distributed computation. In Spark, the pyspark.sql.functions and pyspark.sql.DataFrame operations you typically execute, such as join(), or groupBy() will delegate execution of these operations to the underlying Spark libraries for maximum possible performance. Think of this as using Python simply as a more convenient language on top of SQL where you are lazily describing the operations you want Spark to go do for you.
In this way, when you stick to SQL operations in PySpark, you can expect highly scalable performance, but only for things you can express in SQL. This is where people can typically take a lazy approach and implement their transformations using for loops instead of thinking about the best possible tactics.
Let's consider the case where you want to simply add a single value to an integer column in your DataFrame. You'll find on Stack Overflow and other places plenty of examples in some more subtle cases where they suggest using a collect() to bring the data into a Python list, looping over every row, and pushing the data back into a DataFrame when finished, which is one tactic you could do here. Let's think about what it means in practice, however: you are bringing your data which is hosted in Spark back to the driver of your build, for looping using a single thread in Python over each row, and adding a constant value to each row one at a time. If we instead found the (obvious in this case) SQL equivalent to this operation, Spark could take your data and in massively parallel add the value to individual rows. Namely, if you have 64 executors (instances of workers available to do the work of your job), then you'll have 64 'cores' (this isn't a perfect analogy but is close) to get the data split and sent to each for adding the value to the column. This will let you dramatically more quickly perform the end result operation you wanted.
Doing work on the driver is what I refer to as 'local' computation, and work in executors as 'parallel'.
This may be an obvious example here, but it often times is tough to remember this difference when dealing with more difficult transformations such as advanced windowing operations or linear algebra computations. Spark has libraries available to do matrix multiplications and manipulations in a distributed fashion, as well as some pretty advanced operations on Windows that require a bit more thinking about your problem first.
Lazy evaluation
The most effective way to use PySpark is to dispatch your 'instructions' on how to build your DataFrame all at once, so that Spark can figure out the best way to materialize this data. In this way, functions that force the computation of a DataFrame so you can inspect it at some point in your code should be avoided if at all possible; they mean Spark is working extra to satisfy your print() statement or other method call instead of working towards writing out your data.
Python in Java in Scala
The Python runtime is actually executing inside a JVM that is in turn talking to the Spark runtime, which is written in Scala. So, for every call to collect() where you wish to materialize your data in Python, Spark must materialize your data into a single, locally-available DataFrame, then synthesize this from Scala to its Java equivalent, then finally pass from the JVM to the Python equivalents before it is available to iterate over. This is an incredibly inefficient process that isn't possible to parallelize.
This results in operations that render your data to Python being highly advisable to avoid.
Functions to avoid
So, what functions should you avoid?
collect
head
take
first
show
Each of these methods will force execution on the DataFrame and bring the results back to the Python runtime for display / use. This means Spark won't have the opportunity to lazily figure out the most efficient way to compute upon your data and will instead be forced to bring back the data requested before proceeding with any other execution.

Related

What is the use of task graphs in CUDA 10?

CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.

When to use tensorflow datasets api versus pandas or numpy

There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.
In my situation I have a file data.csv with my features, and would like to do the following two tasks:
Compute targets - the target at time t is the percent change of
some column at some horizon, i.e.,
labels[i] = features[i + h, -1] / features[i, -1] - 1
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
train_features[i] = features[i: i + window]
I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.
Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?
First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:
If all of your input data fit in memory, the simplest way to create a
Dataset from them is to convert them to tf.Tensor objects and use
Dataset.from_tensor_slices()
A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":
While feeding data using a feed_dict offers a high level of
flexibility, in most instances using feed_dict does not scale
optimally. However, in instances where only a single GPU is being used
the difference can be negligible. Using the Dataset API is still
strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.
The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.

How to speed up Gensim Word2vec model load time?

I'm building a chatbot so I need to vectorize the user's input using Word2Vec.
I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300).
So I load the model using Gensim:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long.
So what can I do to speed up the load time?
I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would certainly speed things up but intuition tells me it's not a good idea.
In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you'd want. So use limit=500000 to get the most-frequent 500,000 words' vectors – still a fairly large vocabulary – saving 5/6ths of the memory/load-time.)
So that may help a bit. But if you're re-loading for every web-request, you'll still be hurting from loading's IO-bound speed, and the redundant memory overhead of storing each re-load.
There are some tricks you can use in combination to help.
Note that after loading such vectors in their original word2vec.c-originated format, you can re-save them using gensim's native save(). If you save them uncompressed, and the backing array is large enough (and the GoogleNews set is definitely large enough), the backing array gets dumped in a separate file in a raw binary format. That file can later be memory-mapped from disk, using gensim's native [load(filename, mmap='r')][1] option.
Initially, this will make the load seem snappy – rather than reading all the array from disk, the OS will just map virtual address regions to disk data, so that some time later, when code accesses those memory locations, the necessary ranges will be read-from-disk. So far so good!
However, if you are doing typical operations like most_similar(), you'll still face big lags, just a little later. That's because this operation requires both an initial scan-and-calculation over all the vectors (on first call, to create unit-length-normalized vectors for every word), and then another scan-and-calculation over all the normed vectors (on every call, to find the N-most-similar vectors). Those full-scan accesses will page-into-RAM the whole array – again costing the couple-of-minutes of disk IO.
What you want is to avoid redundantly doing that unit-normalization, and to pay the IO cost just once. That requires keeping the vectors in memory for re-use by all subsequent web requestes (or even multiple parallel web requests). Fortunately memory-mapping can also help here, albeit with a few extra prep steps.
First, load the word2vec.c-format vectors, with load_word2vec_format(). Then, use model.init_sims(replace=True) to force the unit-normalization, destructively in-place (clobbering the non-normalized vectors).
Then, save the model to a new filename-prefix: model.save('GoogleNews-vectors-gensim-normed.bin'`. (Note that this actually creates multiple files on disk that need to be kept together for the model to be re-loaded.)
Now, we'll make a short Python program that serves to both memory-map load the vectors, and force the full array into memory. We also want this program to hang until externally terminated (keeping the mapping alive), and be careful not to re-calculate the already-normed vectors. This requires another trick because the loaded KeyedVectors actually don't know that the vectors are normed. (Usually only the raw vectors are saved, and normed versions re-calculated whenever needed.)
Roughly the following should work:
from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
model.most_similar('stuff') # any word will do: just to page all in
Semaphore(0).acquire() # just hang until process killed
This will still take a while, but only needs to be done once, before/outside any web requests. While the process is alive, the vectors stay mapped into memory. Further, unless/until there's other virtual-memory pressure, the vectors should stay loaded in memory. That's important for what's next.
Finally, in your web request-handling code, you can now just do the following:
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
# … plus whatever else you wanted to do with the model
Multiple processes can share read-only memory-mapped files. (That is, once the OS knows that file X is in RAM at a certain position, every other process that also wants a read-only mapped version of X will be directed to re-use that data, at that position.).
So this web-reqeust load(), and any subsequent accesses, can all re-use the data that the prior process already brought into address-space and active-memory. Operations requiring similarity-calcs against every vector will still take the time to access multiple GB of RAM, and do the calculations/sorting, but will no longer require extra disk-IO and redundant re-normalization.
If the system is facing other memory pressure, ranges of the array may fall out of memory until the next read pages them back in. And if the machine lacks the RAM to ever fully load the vectors, then every scan will require a mixing of paging-in-and-out, and performance will be frustratingly bad not matter what. (In such a case: get more RAM or work with a smaller vector set.)
But if you do have enough RAM, this winds up making the original/natural load-and-use-directly code "just work" in a quite fast manner, without an extra web service interface, because the machine's shared file-mapped memory functions as the service interface.
I really love vzhong's Embedding library. https://github.com/vzhong/embeddings
It stores word vectors in SQLite which means we don't need to load model but just fetch corresponding vectors from DB :D
Success method:
model = Word2Vec.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin',binary=True)
model.init_sims(replace=True)
model.save('bio_word')
later load the model
Word2Vec.load('bio_word',mmap='r')
for more info: https://groups.google.com/forum/#!topic/gensim/OvWlxJOAsCo
I have that problem whenever I use the google news dataset. The issue is there are way more words in the dataset than you'll ever need. There are a huge amount of typos and what not. What I do is scan the data I'm working on, build a dictionary of say the 50k most common words, get the vectors with Gensim and save the dictionary. Loading this dictionary takes half a second instead of 2 minutes.
If you have no specific dataset, you could use the 50 or 100k most common words from a big dataset, such as a news dataset from WMT to get you started.
Other options are to always keep Gensim running. You can create a FIFO for a script running Gensim. The script acts like a "server" that can read a file to which a "client" writes, watching for vector requests.
I think the most elegant solution is to run a web service providing word embeddings. Check out the word2vec API as an example. After installing, getting the embedding for "restaurant" is as simple as:
curl http://127.0.0.1:5000/word2vec/model?word=restaurant

Faceted search and heat map creation on GPU

I am trying to find ways to filter and render 100 million+ data points as a heat map in real time.
Each point in addition to the (x,y) coordinates has a fixed set of attributes (int, date, bit flags) which can be dynamically chosen by the user in order to filter down the data set.
Would it be feasible to accelerate all or parts of this task on GPUs?
It would help if you were more specific, but I'm assuming that you want to apply a user specified filter to the same 2D spatial data. If this is the case, you could consider organizing your data into a spatial datastructure, such as a Quadtree or K-d tree.
Once you have done this, you could run a GPU kernel for each region in your datastructure based on the filter you want to apply. Each thread will figure out which points in its region satisfy the specified filter.
Definitely, this is the kind of problem that fits into the GPGPU spectrum.
You could decide to create your own kernel to filter your data or simply use some functions of vendor's libraries to that end. Probably, you would normalize, interpolate, and so on, which are common utilities in those libraries. These kind of functions are typically embarrassingly parallel, at it shouldn't be difficult to create your own kernel.
I'd rather use a visualization framework that allows you to filter and visualize your data in real time. Vispy is a great option but, of course, there are some others.

Is there efficient way to map graph onto blocks in CUDA programming?

In parallel computing, it is usually the first step to divide the origin problem into some sub-task and map them onto blocks and threads.
For problems with regular data structure, it is very easy and efficient, for example, matrix multiplication, FFT and so on.
But graph theory problems like shortest path, graph traversal, tree search, have irregular data structure. It seems not easy, at least in my mind, to partition the problem onto blocks and threads when using GPU.
I am wondering if there efficient solutions for this kind of partition?
For simplicity, take single-source shortest-path problem as a example. I am stuck at how to divide the graph so that both locality and coalescing.
The tree data structure is designed to best optimize the sequential way of progressing. In tree search, since each state is highly dependent on the previous state, I think it would not be optimal to parallelize traversal on tree.
As far as the graph is concerned, each connected node can be analyzed in parallel, but I guess there might be redundant operations for overlapping paths.
You can use MapGraph which uses GAS method for all the things u mentioned....they also have some example implemented for the same and Library included for Gather, Apply and Scatter in cuda for GPU and cpu only implementation also.
You can find latest version here: http://sourceforge.net/projects/mpgraph/files/0.3.3/