How to input audio data into deep learning algorithm? - deep-learning

I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.

Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.

Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.

Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.

Related

Why does backprop algorithm store the inputs to the non-linearity of the hidden layers?

I have been reading the Deep Learning book by Ian Goodfellow and it mentions in Section 6.5.7 that
The main memory cost of the algorithm is that we need to store the input to the nonlinearity of the hidden layer.
I understand that backprop stores the gradients in a similar fashion to dynamic programming so not to recompute them. But I am confused as to why it stores the input as well?
Backpropagation is a special case of reverse mode automatic differentiation (AD).
In contrast to the forward mode, the reverse mode has the major advantage that you can compute the derivative of an output w.r.t. all inputs of a computation in one pass.
However, the downside is that you need to store all intermediate results of the algorithm you want to differentiate in a suitable data structure (like a graph or a Wengert tape) for as long as you are computing its Jacobian with reverse mode AD, because you're basically "working your way backwards" through the algorithm.
Forward mode AD does not have this disadvantage, but you need to repeat its calculation for every input, so it only makes sense if your algorithm has a lot more output variables than input variables.

Pretrained model or training from scratch for object detection?

I have a dataset composed of 10k-15k pictures for supervised object detection which is very different from Imagenet or Coco (pictures are much darker and represent completely different things, industrial related).
The model currently used is a FasterRCNN which extracts features with a Resnet used as a backbone.
Could train the backbone of the model from scratch in one stage and then train the whole network in another stage be beneficial for the task, instead of loading the network pretrained on Coco and then retraining all the layers of the whole network in a single stage?
From my experience, here are some important points:
your train set is not big enough to train the detector from scratch (though depends on network configuration, fasterrcnn+resnet18 can work). Better to use a pre-trained network on the imagenet;
the domain the network was pre-trained on is not really that important. The network, especially the big one, need to learn all those arches, circles, and other primitive figures in order to use the knowledge for detecting more complex objects;
the brightness of your train images can be important but is not something to stop you from using a pre-trained network;
training from scratch requires much more epochs and much more data. The longer the training is the more complex should be your LR control algorithm. At a minimum, it should not be constant and change the LR based on the cumulative loss. and the initial settings depend on multiple factors, such as network size, augmentations, and the number of epochs;
I played a lot with fasterrcnn+resnet (various number of layers) and the other networks. I recommend you to use maskcnn instead of fasterrcnn. Just command it not to use the masks and not to do the segmentation. I don't know why but it gives much better results.
don't spend your time on mobilenet, with your train set size you will not be able to train it with some reasonable AP and AR. Start with maskrcnn+resnet18 backbone.

Is it theoretically reasonable to use CNN for data like categorical and numeric data?

I'm trying to use CNN to do a binary classification.
As CNN shows its strength in feature extraction, it has been many uses for pattern data like image and voice.
However, the dataset I have is not image or voice data, but categorical data and numerical data, which are different from this case.
My question is as follows.
In this situation, Is it theoretically reasonable to use CNN for data in this configuration?
If it is reasonable, would it be reasonable to artificially place my dataset in a two-dimensional form and perform a 2D-CNN?
I often see examples of using CNN in many classifiers through Kaggle and various media, and I can see not only images and voices, but also numerical and categorical data like mine.
I really wonder this is theoretically a problem, and I would appreciate it if you could recommend it if you knew about the related paper or research.
I'm looking forward to hearing any advice about this situation. Thank you for your answer.
CNNs for images apply kernels to neighboring pixels and blocks of image. CNNs for audio work on spectrograms, i.e. use input data proximity as well.
If your data inputs has some sort of closeness (e.g. time-series, graph...), then CNN might be useful.

Training model to recognize one specific object (or scene)

I am trying to train a learning model to recognize one specific scene. For example, say I would like to train it to recognize pictures taken at an amusement park and I already have 10 thousand pictures taken at an amusement park. I would like to train this model with those pictures so that it would be able to give a score for other pictures of the probability that they were taken at an amusement park. How do I do that?
Considering this is an image recognition problem, I would probably use a convolutional neural network, but I am not quite sure how to train it in this case.
Thanks!
There are several possible ways. The most trivial one is to collect a large number of negative examples (images from other places) and train a two-class model.
The second approach would be to train a network to extract meaningful low-dimensional representations from an input image (embeddings). Here you can use siamese training to explicitly train the network to learn similarities between images. Such an approach is employed for face recognition, for instance (see FaceNet). Having such embeddings, you can use some well-established methods for outlier detections, for instance, 1-class SVM, or any other classifier. In this case you also need negative examples.
I would heavily augment your data using image cropping - it is the most obvious way to increase the amount of training data in your case.
In general, your success in this task strongly depends on the task statement (are restricted to parks only, or any kind of place) and the proper data.

A simple Convolutional neural network code

I am interested in convolutional neural networks (CNNs) as a example of computationally extensive application that is suitable for acceleration using reconfigurable hardware (i.e. lets say FPGA)
In order to do that I need to examine a simple CNN code that I can use to understand how they are implemented, how are the computations in each layer taking place, how the output of each layer is being fed to the input of the next one. I am familiar with the theoretical part (http://cs231n.github.io/convolutional-networks/)
But, I am not interested in training the CNN, I want a complete, self contained CNN code that is pre-trained and all the weights and biases values are known.
I know that there are plenty of CNN libraries, i.e. Caffe, but the problem is that there is no trivial example code that is self contained. even for the simplest Caffe example "cpp_classification" many libraries are invoked, the architecture of the CNN is expressed as .prototxt file, other types of inputs such as .caffemodel and .binaryproto are involved. openCV2 libraries is invoked too. there are layers and layers of abstraction and different libraries working together to produce the classification outcome.
I know that those abstractions are needed to generate a "useable" CNN implementation, but for a hardware person who needs a bare-bone code to study, this is too much of "un-related work".
My question is: Can anyone guide me into a simple and self-contained CNN implementation that I can start with?
I can recommend tiny-cnn. It is simple, lightweight (e.g. header-only) and CPU only, while providing several layers frequently used within the literature (as for example pooling layers, dropout layers or local response normalization layer). This means, that you can easily explore an efficient implementation of these layers in C++ without requiring knowledge of CUDA and digging through the I/O and framework code as required by framework such as Caffe. The implementation lacks some comments, but the code is still easy to read and understand.
The provided MNIST example is quite easy to use (tried it myself some time ago) and trains efficiently. After training and testing, the weights are written to file. Then you have a simple pre-trained model from which you can start, see the provided examples/mnist/test.cpp and examples/mnist/train.cpp. It can easily be loaded for testing (or recognizing digits) such that you can debug the code while executing a learned model.
If you want to inspect a more complicated network, have a look at the Cifar-10 Example.
This is the simplest implementation I have seen: DNN McCaffrey
Also, the source code for this by Karpathy looks pretty straightforward.