Saving Pool with categorical data (Python package) - catboost

I have a dataset primarily composed of categorical features. To support our ML workflow, including hyperparameter tuning, I'd like to save the quantized Pools. It appears this is not supported (see below). What is the best practice approach to snapshotting Pools for reuse when they contain categorical features?
In [8]: train_pool.quantize()
...: train_pool.save("train_pool.bin")
uncaught exception:
address -> 0x11000a800
what() -> "catboost/libs/quantized_pool/serialization.cpp:960: Saving quantization results is supported only for numerical features"
type -> TCatBoostException
Aborted
I'm using Catboost 0.17.4.

Related

pytorch static quantization: different training(calibration) and inference backends

Can we use a different CPU architecture(and backend) for training(calibration) and inference of the quantized pytorch model?
The only post on this subject that I've found states:
static quantization must be performed on a machine with the same
architecture as your deployment target. If you are using FBGEMM, you
must perform the calibration pass on an x86 CPU; if you are using
QNNPACK, calibration needs to happen on an ARM CPU
But there is nothing about this in the official documentation.
The information in the link you posted is correct. You should use the same backend in both the cases. This is also mentioned in the official documentation-
"When preparing a quantized model, it is necessary to ensure that qconfig and the engine used for quantized computations match the backend on which the model will be executed."
Find it here
https://pytorch.org/docs/stable/quantization.html

NIfTi vs DICOM for 3D volumetric data

Are there major benefits of selecting NIfTi over DICOM (or viz.) as the choice of data format? I am working on 3D Volumetric semantic segmentation. I will have to convert either format to numpy array or tensor before feeding to the network, but curious on the performance benefits of selection.
(This question risks being opinion-based, so trying to stick to facts.)
DICOM is a very powerful, flexible but complex format, and its strength is to provide interoperability between different hardware and software. However, DICOM is not particularly efficient for image processing and analysis. One potential drawback of DICOM is that a single volume is stored as a sequence of 2D slices, which can be cumbersome to deal with.
NIfTi is an improved version of the Analyze file format, which was designed to be simpler than DICOM, while still retaining all the essential metadata. And it has the added benefit of being able to store a volume in a single file, with a simple header followed by raw data. This makes it fast to load and process.
There are several other medical file formats suitable for this task. You may also wish to consider NRRD which has many features in common with NIfTi. Simple format, fast to parse and load, flexible storage encoding for 2,3,4D data. Many tools and libraries can process NRRD files too.
So given your primary need is for efficient storage and analysis, NIfTi or NRRD would be a better choice.

Sagemaker model evaluation

The Amazon documentation lists several approaches to evaluate a model (e.g. cross validation, etc.) however these methods does not seem to be available in the Sagemaker Java SDK.
Currently if we want to do 5-fold cross validation it seems the only option is to create 5 models (and also deploy 5 endpoints) one model for each subset of data and manually compute the performance metric (recall, precision, etc.).
This approach is not very efficient and can also be expensive need to deploy k-endpoints, based on the number of folds in the k-fold validation.
Is there another way to test the performance of a model?
Amazon SageMaker is a set of multiple components that you can choose which ones to use.
The built-in algorithms are designed for (infinite) scale, which means that you can have huge datasets and be able to build a model with them quickly and with low cost. Once you have large datasets you usually don't need to use techniques such as cross-validation, and the recommendation is to have a clear split between training data and validation data. Each of these parts will be defined with an input channel when you are submitting a training job.
If you have a small amount of data and you want to train on all of it and use cross-validation to allow it, you can use a different part of the service (interactive notebook instance). You can bring your own algorithm or even container image to be used in the development, training or hosting. You can have any python code based on any machine learning library or framework, including scikit-learn, R, TensorFlow, MXNet etc. In your code, you can define cross-validation based on the training data that you copy from S3 to the worker instances.

When to use tensorflow datasets api versus pandas or numpy

There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.
In my situation I have a file data.csv with my features, and would like to do the following two tasks:
Compute targets - the target at time t is the percent change of
some column at some horizon, i.e.,
labels[i] = features[i + h, -1] / features[i, -1] - 1
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
train_features[i] = features[i: i + window]
I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.
Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?
First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:
If all of your input data fit in memory, the simplest way to create a
Dataset from them is to convert them to tf.Tensor objects and use
Dataset.from_tensor_slices()
A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":
While feeding data using a feed_dict offers a high level of
flexibility, in most instances using feed_dict does not scale
optimally. However, in instances where only a single GPU is being used
the difference can be negligible. Using the Dataset API is still
strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.
The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.

Is it possible to create and execute a deep learning model and doing its prediction in c++?

Let's suppose that my chip doesn't support any API like keras, tensorflow or sklearn; however I need to implement a deep learning model in python.
Is it possible to make my training and testing model in python, then, I want to call the best model results for prediction with C++?
Where I mus save the resulted best model in order to be called in the next steps? Must I save it in the chip? Did I need to install tensorflow and keras in my chip in this case?
TERMINOLOGY
You seem to be confused about terminology. Here's a somewhat simplified overview.
Your chip is the hardware (CPU or GPU), and will include circuitry to support its instruction set (move data to/from local memory, perform math and logic operations, etc.). A CPU/GPU chip that cannot support your ML software is hard to visualize, and would not support Python or C++, either. The chip comes on a board, which includes a lot of peripheral connections, secondary memory, etc.
Then your operating system (basic software) is installed on the hardware. This OS manages resources: jobs, processes, memory allocation, etc. If there's a failure in support, it would be here, not in the chip. Finally, you install your desired applications (software tools, programs, etc.) as additions to the OS.
C++ and Python are two high-level languages, popular applications. These languages support Tensorflow and Keras (machine learning frameworks) and SciKit (scientific / statistical package; sklearn is the package name you import).
DIRECT ANSWER
Yes, you can write your NN in Python. Yes, you can call it from C++. Python depends on C/C++ libraries; there is a viable interface between the two.
There is no particular method you must use to save your model and call it later: if you're writing your own model in Python, you get to decide the storage format and location. All you need is to have your Python and C++ programs "agree" on the format. Since you're writing them both, then you can choose whatever works for you.
RECOMMENDATION
Don't write these yourself, unless you really want the exercise. Instead, install a framework (TensorFlow, Caffe, Neon, Torch, MXNet, Keras, ...). Then, simply follow the given tutorials to learn how to build, save, and restore your model.