When to use tensorflow datasets api versus pandas or numpy - csv

There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.
In my situation I have a file data.csv with my features, and would like to do the following two tasks:
Compute targets - the target at time t is the percent change of
some column at some horizon, i.e.,
labels[i] = features[i + h, -1] / features[i, -1] - 1
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
train_features[i] = features[i: i + window]
I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.
Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?

First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:
If all of your input data fit in memory, the simplest way to create a
Dataset from them is to convert them to tf.Tensor objects and use
Dataset.from_tensor_slices()
A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":
While feeding data using a feed_dict offers a high level of
flexibility, in most instances using feed_dict does not scale
optimally. However, in instances where only a single GPU is being used
the difference can be negligible. Using the Dataset API is still
strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.
The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.

Related

Why should I not use collect() in my Python Transforms?

TL;DR: I hear rumors that certain PySpark functions aren't advisable in Transforms, but I'm not sure what functions are wrong and why they are so?
Why can't I just collect() my data in certain circumstances to a list and iterate over the rows?
There's a lot of pieces here one needs to understand to arrive at the final conclusion, namely that collect() and other functions are inefficient uses of Spark.
Local vs. Distributed
First, let's cover the difference between local vs. distributed computation. In Spark, the pyspark.sql.functions and pyspark.sql.DataFrame operations you typically execute, such as join(), or groupBy() will delegate execution of these operations to the underlying Spark libraries for maximum possible performance. Think of this as using Python simply as a more convenient language on top of SQL where you are lazily describing the operations you want Spark to go do for you.
In this way, when you stick to SQL operations in PySpark, you can expect highly scalable performance, but only for things you can express in SQL. This is where people can typically take a lazy approach and implement their transformations using for loops instead of thinking about the best possible tactics.
Let's consider the case where you want to simply add a single value to an integer column in your DataFrame. You'll find on Stack Overflow and other places plenty of examples in some more subtle cases where they suggest using a collect() to bring the data into a Python list, looping over every row, and pushing the data back into a DataFrame when finished, which is one tactic you could do here. Let's think about what it means in practice, however: you are bringing your data which is hosted in Spark back to the driver of your build, for looping using a single thread in Python over each row, and adding a constant value to each row one at a time. If we instead found the (obvious in this case) SQL equivalent to this operation, Spark could take your data and in massively parallel add the value to individual rows. Namely, if you have 64 executors (instances of workers available to do the work of your job), then you'll have 64 'cores' (this isn't a perfect analogy but is close) to get the data split and sent to each for adding the value to the column. This will let you dramatically more quickly perform the end result operation you wanted.
Doing work on the driver is what I refer to as 'local' computation, and work in executors as 'parallel'.
This may be an obvious example here, but it often times is tough to remember this difference when dealing with more difficult transformations such as advanced windowing operations or linear algebra computations. Spark has libraries available to do matrix multiplications and manipulations in a distributed fashion, as well as some pretty advanced operations on Windows that require a bit more thinking about your problem first.
Lazy evaluation
The most effective way to use PySpark is to dispatch your 'instructions' on how to build your DataFrame all at once, so that Spark can figure out the best way to materialize this data. In this way, functions that force the computation of a DataFrame so you can inspect it at some point in your code should be avoided if at all possible; they mean Spark is working extra to satisfy your print() statement or other method call instead of working towards writing out your data.
Python in Java in Scala
The Python runtime is actually executing inside a JVM that is in turn talking to the Spark runtime, which is written in Scala. So, for every call to collect() where you wish to materialize your data in Python, Spark must materialize your data into a single, locally-available DataFrame, then synthesize this from Scala to its Java equivalent, then finally pass from the JVM to the Python equivalents before it is available to iterate over. This is an incredibly inefficient process that isn't possible to parallelize.
This results in operations that render your data to Python being highly advisable to avoid.
Functions to avoid
So, what functions should you avoid?
collect
head
take
first
show
Each of these methods will force execution on the DataFrame and bring the results back to the Python runtime for display / use. This means Spark won't have the opportunity to lazily figure out the most efficient way to compute upon your data and will instead be forced to bring back the data requested before proceeding with any other execution.

NIfTi vs DICOM for 3D volumetric data

Are there major benefits of selecting NIfTi over DICOM (or viz.) as the choice of data format? I am working on 3D Volumetric semantic segmentation. I will have to convert either format to numpy array or tensor before feeding to the network, but curious on the performance benefits of selection.
(This question risks being opinion-based, so trying to stick to facts.)
DICOM is a very powerful, flexible but complex format, and its strength is to provide interoperability between different hardware and software. However, DICOM is not particularly efficient for image processing and analysis. One potential drawback of DICOM is that a single volume is stored as a sequence of 2D slices, which can be cumbersome to deal with.
NIfTi is an improved version of the Analyze file format, which was designed to be simpler than DICOM, while still retaining all the essential metadata. And it has the added benefit of being able to store a volume in a single file, with a simple header followed by raw data. This makes it fast to load and process.
There are several other medical file formats suitable for this task. You may also wish to consider NRRD which has many features in common with NIfTi. Simple format, fast to parse and load, flexible storage encoding for 2,3,4D data. Many tools and libraries can process NRRD files too.
So given your primary need is for efficient storage and analysis, NIfTi or NRRD would be a better choice.

How to input additional data into the network when high-level APIs are being used in MXNet

I am studying MXNet framework and I need to input a matrix into the network during every iteration. The matrix is stored in external memory, it is not the training data and it is updated by the output of the network at the end of each iteration. During the iteration, the matrix must be input into the network.
If I use high level APIs, i.e.
model = mx.mod.Module(context=ctx, symbol=sym)
... ...
model.fit(train_data_iter, begin_epoch=begin_epoch,
end_epoch=end_epoch, ......)
Can this be implemented?
model.fit() doesn't provide the functionality that you're looking for. However what you want to achieve is extremely easy to do in Gluon API of Apache MXNet. With Gluon API, you write a 7-line code for your training loop, rather than using a single model.fit(). This is a typical training loop code:
for epoch in range(10):
for data, label in train_data:
# forward + backward
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(batch_size) # update parameters
So if you wanted to feed the output of your network back into input, you can easily achieve that. To get started on Gluon, I recommend the 60-minute Gluon Crash Course. To become an expert in Gluon, I recommend Deep Learning - The Straight Dope book as well as a comprehensive set of tutorials on the main MXNet website: http://mxnet.apache.org/tutorials/index.html

Sagemaker model evaluation

The Amazon documentation lists several approaches to evaluate a model (e.g. cross validation, etc.) however these methods does not seem to be available in the Sagemaker Java SDK.
Currently if we want to do 5-fold cross validation it seems the only option is to create 5 models (and also deploy 5 endpoints) one model for each subset of data and manually compute the performance metric (recall, precision, etc.).
This approach is not very efficient and can also be expensive need to deploy k-endpoints, based on the number of folds in the k-fold validation.
Is there another way to test the performance of a model?
Amazon SageMaker is a set of multiple components that you can choose which ones to use.
The built-in algorithms are designed for (infinite) scale, which means that you can have huge datasets and be able to build a model with them quickly and with low cost. Once you have large datasets you usually don't need to use techniques such as cross-validation, and the recommendation is to have a clear split between training data and validation data. Each of these parts will be defined with an input channel when you are submitting a training job.
If you have a small amount of data and you want to train on all of it and use cross-validation to allow it, you can use a different part of the service (interactive notebook instance). You can bring your own algorithm or even container image to be used in the development, training or hosting. You can have any python code based on any machine learning library or framework, including scikit-learn, R, TensorFlow, MXNet etc. In your code, you can define cross-validation based on the training data that you copy from S3 to the worker instances.

Goal Seek in Octave to replicate Excel's 'Solver' Macro

This is essentially a question on fundamentals, and whether or not there is a more efficient way to achieve what I am looking for. I have built a working fluid dynamics calculator in Excel to find the flow rates required for a target pressure loss, the optimisation is handled using Solver but it's very clunky and not user friendly.
I'm trying to replicate the function in Octave since it's widely used here, but I am a complete beginner; I'm probably missing something obvious. I can easily enter all of the math for a single iteration via a series of functions, but my excel file required using the 'Solver' macro, and I'm unsure how to efficiently replicate this in Octave.
I am aware that linprog (in matlab) and glpk (octave) can be used to solve systems of linear equations.
I have a series of nested equations which are all dependant on a single matrix, Q (flow rates at various locations). Many other inputs are required, but they either remain constant throughout calculation (e.g. system geometry) or are dictated by Q (e.g. Reynolds number and loss coefficients). In trying to simplify my problem I have settled on two steps:
Write code to solve my problem, input: Q matrix, output: pressure loss matrix
Create a loop that iterates different Q matrices until some conditions for the pressure loss matrix are met.
I don't think it will be practical to get my expressions into the form of A*x = B (in order to use glpk) given the complexity. In excel, I can point solver at a Q value that drives a multitude of equations that impact pressure loss, and it will find the value I need to achieve a target. How can I most efficiently replicate this functionality in Octave?
First off all Solver is not a macro. Pretty far from.
So, you're going to replicate a comprehensive "What-If" Analysis Plug-in -- so complex in fact, that Microsoft chose to contract a 3rd Party company of experts to develop the tool and provide support for it (successfully based on the 1.2 Billion copies they've distributed).
And you're going to this an inferior coding language that you're a complete beginner with? Cool. I'd like to see this!
Cool. Here's a checklist of Solver's features, so you don't miss anything:
Good Luck!
More Information:
Wikipedia : Solver
Office.com : Define and Solve a Problem by using Solver
Frontline: Official Solver Page: http://solver.com
AppSource.Microsoft.com : Solver (with Video)
Frontline:L Solver International Manazine