I want to use Flux to train a Deep Learning model on audio files. In Flux documentation, they passed the whole data array (with all examples) to a dataloader that would feed the train!() function with a list of batches. The point is that I have not enough memory in my system to load all audio files at once.
In PyTorch, the dataloader would be fed by a dataset object that would have the logic to open one file at a time on the __getitem__() method.
So, what is the right way to implement it in Flux/Julia, what is the Torch dataset equivalent?
I found this thread on Julia discourse forum that covers basically what I am asking in this question.
https://discourse.julialang.org/t/pytorch-dataloader-equivalent-for-training-large-models-with-flux/30763
From some recommendations o the topic, there is this one, the package MLDataUtils.jl, that offer similar functionality with the nobs() and getobs() functions.
Related
Does number of parameters and FLOPS (float operations per second) change when convert a model from PyTorch to ONNX or TensorRT format?
I don't think Anvar's post answered OP's question thoroughly so I did a little bit of research. Some general info before the answers to the questions as I believe OP hasn't understood fully what TensorRT and ONNX optimizations happen during the conversion from PyTorch format.
Both conversions, Pytorch to ONNX and ONNX to TensorRT increase the performance of the model by using several different optimizations. The tools actually print you information about what they do if you choose the verbose flag for them.
The preferred way to convert a Pytorch model to TensorRT is to use Torch-TensorRT as explained here.
TensorRT fuses layers and tensors in the model graph, it then uses a large kernel library to select implementations that perform best on the target GPU.
ONNX runtime offers mostly graph optimizations such as graph simplifications and node fusions to improve performance.
1. Does the number of parameters change when converting a PyTorch model to ONNX or TensorRT?
No: even though the layers are fused the number of parameters does not decrease unless there are some redundant branches in the model.
I tested this by downloading the yolov5s.onnx model here. The original model has 7.2M parameters according to the repository authors. Then I used this tool to count the number of parameters in the yolov5.onnx model and got 7225917 as a result. Thus, onnx conversion did not reduce the amount of parameters.
I was not able to get as elaborate information for TensorRT model but you can get layer information using trtexec. There is a recent question about this but there are no answers yet.
2. Does the number of FLOPS change when converting a PyTorch model to ONNX or TensorRT?
According to this post, no.
I know that since some of new versions of Pytorch (I used 1.8 and it worked for me) there are some fusions of batch norm layers and convolutions while saving model. I'm not sure about ONNX, but TensorRT actively uses horizontal and vertical fusion of different layers, so final model would be computational cheaper, than model that you initialized.
I'm working on a project where I need to estimate the 6DOF pose of a known 3D CAD object in a single RGB image - i.e. this task: https://paperswithcode.com/task/6d-pose-estimation. There are several constraints on the problem:
Usable commercially (licensed under BSD, MIT, BOOST, etc.), not GPL.
The CAD object is known and we do NOT aim for generality (i.e.recognize the class of all chairs).
The CAD object can be uploaded by a user, so it may have symmetries and a range of textures.
Inference step will be run on a smartphone, and should be able to run at >30fps.
The inference step can either be a) find the pose of the object once and then I can write code to continue to track it or b) find the pose of the object continuously. I.e. the model doesn't need to have any continuous refinement steps after the initial pose estimate is found.
Can be anywhere on the scale of single instance of a single object to multiple instances of multiple objects (MiMo). MiMO is preferred, but not required.
If a deep learning approach is used, the training time required for a new CAD object should be on the order of hours, not days.
Can either 1) just find the initial pose of an object and not have any refinement steps after or 2) find the initial pose of the object and also have refinement steps after.
I am open to traditional approaches (i.e. 2D->3D correspondences then solving with PnP), but it seems like deep learning approaches outperform them (classical are too slow - Real time 6D pose estimation of known 3D CAD objects from a single 2D image or point clouds from RGBD Camera when objects are one on top of the other?). Looking at deep learning approaches (poseCNN, HybridPose, Pix2Pose, CosyPose), it seems most of them match these constraints, except that they require model training time. Though perhaps I can use a single pre-trained model and then specialize it for each new CAD object with a shorter training step. But I am not sure of this, and I think success probably relies on the specific model chosen. For example, this project says it requires 3 hours of training time: https://github.com/DLR-RM/AugmentedAutoencoder.
So, my question: would somebody know what the state of the art, commercially usable implementation that doesn't require extensive training time for a new CAD object is?
I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.
I am studying MXNet framework and I need to input a matrix into the network during every iteration. The matrix is stored in external memory, it is not the training data and it is updated by the output of the network at the end of each iteration. During the iteration, the matrix must be input into the network.
If I use high level APIs, i.e.
model = mx.mod.Module(context=ctx, symbol=sym)
... ...
model.fit(train_data_iter, begin_epoch=begin_epoch,
end_epoch=end_epoch, ......)
Can this be implemented?
model.fit() doesn't provide the functionality that you're looking for. However what you want to achieve is extremely easy to do in Gluon API of Apache MXNet. With Gluon API, you write a 7-line code for your training loop, rather than using a single model.fit(). This is a typical training loop code:
for epoch in range(10):
for data, label in train_data:
# forward + backward
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(batch_size) # update parameters
So if you wanted to feed the output of your network back into input, you can easily achieve that. To get started on Gluon, I recommend the 60-minute Gluon Crash Course. To become an expert in Gluon, I recommend Deep Learning - The Straight Dope book as well as a comprehensive set of tutorials on the main MXNet website: http://mxnet.apache.org/tutorials/index.html
There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.
In my situation I have a file data.csv with my features, and would like to do the following two tasks:
Compute targets - the target at time t is the percent change of
some column at some horizon, i.e.,
labels[i] = features[i + h, -1] / features[i, -1] - 1
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
train_features[i] = features[i: i + window]
I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.
Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?
First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:
If all of your input data fit in memory, the simplest way to create a
Dataset from them is to convert them to tf.Tensor objects and use
Dataset.from_tensor_slices()
A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":
While feeding data using a feed_dict offers a high level of
flexibility, in most instances using feed_dict does not scale
optimally. However, in instances where only a single GPU is being used
the difference can be negligible. Using the Dataset API is still
strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.
The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.