I would like use the efficiency of transformer architecture to do anomaly detection on time series based on transformers. I am wondering:
Can we modify slightly the architecture to create a bottleneck in the transformer network (similar to a fully connected network AutoEncoder, or AE with LSTMs).
does it actually makes sense to try to do that
I would like the transformer to learn how to reconstruct in output the input sequence, with some intermediate latent space that has lower dimensionality (bottleneck).
My idea was to reduce d_model (number of variables in the time series, or embedding dimension in nlp) but it must be of the same size of the input series according to `torch.nn.Transformer`` (see here)
I am experimenting with object segmentation(round shaped objects that are often occur close together). I have used UNET deep neural network architecture for segmentation and obtained segmentation masks. I saved those in npy format.
I am a beginner in this area. I would like to know the ideal steps that I should follow now, if I want to apply watershed on the predicted masks with the aim of separating the objects.
I guess I need to convert the binary mask predicted to some form so that I can obtain some kind of markers indicating centroids.
Please help
I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.
I often see guides and examples using Convolutional Layers when implementing Deep Q-Networks. This makes sense for some scenarios, typically where you do not have access to the state in for example an array representation.
In my case, I have a game environment which gives me complete access to the state, in form of a 2D array. This 2D array is later interpreted by a graphics engine and dawn to the screen.
I have been recommended to use Convolutional Layers for interpreting images, but I have yet to see any recommendations about flattening the 2D State representation directly and utilize dense layers instead.
Does it make any sense to use Convolutional Networks/Layers for data which are not an image?
Recently I am learning about the ideal about the embedding layer in neural networks. The best explanation I found so far is here The explanation there well addressed the core concept of why to use embedding layer and how it works.
It also mentioned that our embedding will map similar words to similar region. And thus the quality of our embedding representation is how close or similar that a group of similar representations from original space is in embedding space. But I really have no ideal of how to do it.
My question is, how to design the weight matrix in order to have a better embedding representation that is customised for specific dataset ?
Any hint would be really helpful to me!
Thank you all!
Suppose you know some concepts of neural networks and Word2Vec, I try to explain things briefly.
1, the weight matrix in the embedding layer is often randomly initialized just like weights in other types of neural networks layers.
2, the weight matrix in the embedding layer transforms the sparse input into a dense vector as explained in the post you mentioned.
3, the weight matrix in the embedding layer can be updated during the training process using your dataset along the backpropagation.
Therefore, after training, the learned weight matrix should give you better representations of your specific data. Just like how word embedding works, more data often yields better representations in the embedding layer. Another factor is the number of dimension(generally speaking, the higher dimension, the more degrees of freedom the model will have to learn the representations of the features).