The Amazon documentation lists several approaches to evaluate a model (e.g. cross validation, etc.) however these methods does not seem to be available in the Sagemaker Java SDK.
Currently if we want to do 5-fold cross validation it seems the only option is to create 5 models (and also deploy 5 endpoints) one model for each subset of data and manually compute the performance metric (recall, precision, etc.).
This approach is not very efficient and can also be expensive need to deploy k-endpoints, based on the number of folds in the k-fold validation.
Is there another way to test the performance of a model?
Amazon SageMaker is a set of multiple components that you can choose which ones to use.
The built-in algorithms are designed for (infinite) scale, which means that you can have huge datasets and be able to build a model with them quickly and with low cost. Once you have large datasets you usually don't need to use techniques such as cross-validation, and the recommendation is to have a clear split between training data and validation data. Each of these parts will be defined with an input channel when you are submitting a training job.
If you have a small amount of data and you want to train on all of it and use cross-validation to allow it, you can use a different part of the service (interactive notebook instance). You can bring your own algorithm or even container image to be used in the development, training or hosting. You can have any python code based on any machine learning library or framework, including scikit-learn, R, TensorFlow, MXNet etc. In your code, you can define cross-validation based on the training data that you copy from S3 to the worker instances.
Related
I have been experimenting with actor-critic networks such as SAC and TD3 on continuous control tasks and trying to do transfer learning using the trained network to another task with smaller observation and action space.
Would it be possible to do so if i were to save the weights in a dictionary and then load it in the new environment? The inputs to the Actor-Critic network requires a state with different dimensions as well as outputting an actor with different dimensions.
I had some experience doing fine-tuning with transformer models by addind another classifier head and fine-tuning it, but how would i do this with Actor-Critic networks, if the initial layer and final layer does not match with the learned agent.
pretty new to deep learning, but couldn't seem to find/figure out what are backend weights such as
full_yolo_backend.h5
squeezenet_backend.h5
From what I have found and experimented, these backend weights have fundamentally different model architectures such as
yolov2 model has 40+ layers but the backend only 20+ layers (?)
you can build on top of the backend model with your own networks (?)
using backend models tend to yield poorer results (?)
I was hoping to seek some explanation on backend weights vs actual models for learning purposes. Thank you so much!
I'm note sure which implementation you are using but in many applications, you can consider a deep model as a feature extractor whose output is more or less task-agnostic, followed by a number of task-specific heads.
The choice of backend depends on your specific constraints in terms of tradeoff between accuracy and computational complexity. Examples of classical but time-consuming choices for backends are resnet-101, resnet-50 or VGG that can be coupled with FPN (feature pyramid networks) to yield multiscale features. However, if speed is your main concern then you can use smaller backends such as different MobileNet architectures or even the vanilla networks such as the ones used in the original Yolov1/v2 papers (tinyYolo is an extreme case).
Once you have chosen your backend (you can use a pretrained one), you can load its weights (that is what your *h5 files are). On top of that, you will add a small head that will carry the tasks that you need: this can be classification, bbox regression, or like in MaskRCNN forground/background segmentation. For Yolov2, you can just add very few, for example 3 convolutional layers (with non-linearities of course) that will output a tensor of size
BxC1xC2xAxP
#B==batch size
#C1==number vertical of cells
#C2==number of horizontal cells
#C3==number of anchors
#C4==number of parameters (i.e. bbx parameters, class prediction, confidence)
Then, you can just save/load the weights of this head separately. When you are happy with your results though, training jointly (end-to-end) will usually give you a small boost in accuracy.
Finally, to come back to your last questions, I assume that you are getting poor results with the backends because you are only loading backend weights but not the weights of the heads. Another possibility is that you are using a head trained with a backends X but that you are switching the backend to Y. In that case since the head expects different features, it's natural to see a drop in performance.
I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.
I am working with Neural network for object classification right now. I am working on creating datasets for training and validation. I want to know if it is possible to create two datasets for training comprising of two completely different objects and labels. (EG dataset 1 has cars and dataset 2 has cats) Does it still work or should I create datasets where each file has mixed up both the different object types and labels in all the training files? Does such mixture/separation matter if I am training the network in one cycle with different datasets?
Depending on what you are using to train, many APIs (such as TensorFlow object detection) read the TF Record files (datasets) in order as they are scrambled to make the files beforehand. Scrambling is quite important with training as you will get the model starting training with one class, and then train for a bit with another individual class. It should get to the same standard eventually, but it is a lot better for the model to train with an equal distribution of classes of the training steps.
There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.
In my situation I have a file data.csv with my features, and would like to do the following two tasks:
Compute targets - the target at time t is the percent change of
some column at some horizon, i.e.,
labels[i] = features[i + h, -1] / features[i, -1] - 1
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
train_features[i] = features[i: i + window]
I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.
Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?
First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:
If all of your input data fit in memory, the simplest way to create a
Dataset from them is to convert them to tf.Tensor objects and use
Dataset.from_tensor_slices()
A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":
While feeding data using a feed_dict offers a high level of
flexibility, in most instances using feed_dict does not scale
optimally. However, in instances where only a single GPU is being used
the difference can be negligible. Using the Dataset API is still
strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.
The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.