What is the best way to read large datasets with less latency for torch model? - deep-learning

I am familiar to use some data specific layers or tools by the habits of my in use libraries like mxnet (recordIO), caffe(LMDB). They also handle data augmentation tricks as well. I decided to give a shot for Torch to see what is good or bad. From the initial gaze, there is no such data loading standard for Torch or I am missing it. Could you point some good ways to load large scale datasets for my model training?

Some pointers:
twitter/torch-dataset
fb.resnet.torch/dataloader
Element-Research/dataload
More resources can also be found in the torch7/wiki.

Related

is there a way to compress a NN in pytorch

assumed I have a trained model of torch.nn.Module, and I only need it from now on for evaluation.
Is PyTorch passing my data through all the layers or is it compressing the model so that it only calculates an equivalent function?
If not, is there a way to do it in order to make the calculation faster and the model lighter in memory terms?
I have been looking on the internet for a similar question and didn't find any suitable answer.
You should do two things during inference: set your model to evaluation mode with model.eval() and use no_grad() (disables gradient computation). This will make your code faster and more memory-efficient.
In practice this will look like
model.eval()
with torch.no_grad():
#your inference code here
There are many options and it depends on your specific case.
One option is to convert to TorchScript.
Another option is to do quantization on the model.
Third, you could perform knowledge distillation, transferring the knowledge from your existing model to a smaller one.

What is the efficient way to get compressed ML models in multiple sizes of one model for non ML expert?

I'm not an ML expert and know just a little background about it.
I know that there are techniques to reduce the size of neural networks like distillation and pruning. But I don't know how to efficiently perform those techniques.
Now I need to solve a quite practical problem. I'd like to ship FaceNet, the face recognition model, to mobile devices. There might be trade-offs between recognition accuracy and performance + size. I don't know which size of model would fit best my demands. I think I need to test models in many sizes and figure out which one is the best empirically. To do this, I should acquire models in many sizes that are obtained by compressing the pre-trained FaceNet on its website. For example, 30Mb version of FaceNet, 40Mb version of FaceNet, etc.
However, model compression is not free and costs money. I'm worried that I'd do something very stupid and expensive. What is the recommended way to do this? Does this problem have any common solution for not ML expert?
Network "compression" is not like a slider in which you can move anywhere from slow/accurate to fast/not-accurate.
From your starting model you can apply some techniques which may or may not reduce the size of the network and may or may not reduce also the accuracy. For example converting weights from float32 to float16 will for sure reduce the memory requirement of the network by half, but can also introduce a small reduction in accuracy and it's not supported on all devices.
My suggestion is: start with the base model and perform some tests with it. Understand how far you are from the target FPS and size of your application and then decide which approach makes more sense to reach the objective you have in mind.
Since you said you're not an ML expert I think this kind of task (at least to my knowledge) requires some amount of studying of the subject and experimentation and I would start with an open source solution like for example https://tvm.apache.org/.

How to input audio data into deep learning algorithm?

I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.

When to use tensorflow datasets api versus pandas or numpy

There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.
In my situation I have a file data.csv with my features, and would like to do the following two tasks:
Compute targets - the target at time t is the percent change of
some column at some horizon, i.e.,
labels[i] = features[i + h, -1] / features[i, -1] - 1
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
train_features[i] = features[i: i + window]
I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.
Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?
First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:
If all of your input data fit in memory, the simplest way to create a
Dataset from them is to convert them to tf.Tensor objects and use
Dataset.from_tensor_slices()
A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":
While feeding data using a feed_dict offers a high level of
flexibility, in most instances using feed_dict does not scale
optimally. However, in instances where only a single GPU is being used
the difference can be negligible. Using the Dataset API is still
strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.
The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.

A simple Convolutional neural network code

I am interested in convolutional neural networks (CNNs) as a example of computationally extensive application that is suitable for acceleration using reconfigurable hardware (i.e. lets say FPGA)
In order to do that I need to examine a simple CNN code that I can use to understand how they are implemented, how are the computations in each layer taking place, how the output of each layer is being fed to the input of the next one. I am familiar with the theoretical part (http://cs231n.github.io/convolutional-networks/)
But, I am not interested in training the CNN, I want a complete, self contained CNN code that is pre-trained and all the weights and biases values are known.
I know that there are plenty of CNN libraries, i.e. Caffe, but the problem is that there is no trivial example code that is self contained. even for the simplest Caffe example "cpp_classification" many libraries are invoked, the architecture of the CNN is expressed as .prototxt file, other types of inputs such as .caffemodel and .binaryproto are involved. openCV2 libraries is invoked too. there are layers and layers of abstraction and different libraries working together to produce the classification outcome.
I know that those abstractions are needed to generate a "useable" CNN implementation, but for a hardware person who needs a bare-bone code to study, this is too much of "un-related work".
My question is: Can anyone guide me into a simple and self-contained CNN implementation that I can start with?
I can recommend tiny-cnn. It is simple, lightweight (e.g. header-only) and CPU only, while providing several layers frequently used within the literature (as for example pooling layers, dropout layers or local response normalization layer). This means, that you can easily explore an efficient implementation of these layers in C++ without requiring knowledge of CUDA and digging through the I/O and framework code as required by framework such as Caffe. The implementation lacks some comments, but the code is still easy to read and understand.
The provided MNIST example is quite easy to use (tried it myself some time ago) and trains efficiently. After training and testing, the weights are written to file. Then you have a simple pre-trained model from which you can start, see the provided examples/mnist/test.cpp and examples/mnist/train.cpp. It can easily be loaded for testing (or recognizing digits) such that you can debug the code while executing a learned model.
If you want to inspect a more complicated network, have a look at the Cifar-10 Example.
This is the simplest implementation I have seen: DNN McCaffrey
Also, the source code for this by Karpathy looks pretty straightforward.